Scope of the Project:
The project aims to analyze and derive insights from the ERA5 (ECMWF Reanalysis version 5) weather dataset. It involves both regression and classification analyses to model and predict various weather parameters as well as gain insights on correlations, causations and patterns involving the parameters 1. The ERA5 dataset, provided by the European Centre for Medium-Range Weather Forecasts (ECMWF), is a high-quality global atmospheric reanalysis dataset covering multiple decades 2. The analysis will focus on the region of Bancroft in Ontario, Canada for the years 2015 to 2022.
Exploring the Atmosphere Through Data Weather, a phenomenon as old as Earth itself, has long been a subject of human fascination. The complex interplay of temperature, wind, and precipitation shapes our environment, affects our lives, and challenges our understanding of the natural world 3. With the advent of technology and data analysis techniques, a deeper exploration of these complex processes is now possible. 4. The project, titled "Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset", aims to contribute to this ongoing exploration by examining how different variables interact to create complex weather phenomena.
Geographic Focus: Bancroft, Canada While this project leverages the ERA5 dataset, Bancroft, Ontario in Canada is chosen to specifically focus on, for an in-depth analysis. The geographic location of Bancroft can be seen in picture 1. This decision is rooted in the unique climatic and meteorological characteristics of the region, which offer a distinctive case study for understanding complex weather phenomena. The geographical and climatic characteristics of Bancroft (strongly influenced by the so called 'lake-effect') create an excellent case study for analyzing the relationships among different atmospheric elements, which also leads to interest in that area by IBM and IBM clients 5. The work presented is limited to the evaluation of the years 2015 to 2022 in order to include only the current and latest climate developments and to record the current status without falsifying the results due to strongly deviating values from observations further in the past. Focusing on Bancroft enables a targeted application of regression and classification techniques, facilitating a detailed analysis of localized weather patterns and their impacts 6.
Picture 1: Complete Map of Ontario in Canada, containing the relevant regions for the analysis (Bancroft in the centered southeast). Copyright: IBM Deutschland GmbH
The Impetus for Investigation The motivation of this project is rooted in historical interest and present-day necessity. As discussed in a systematic review titled "The contribution of weather forecast information to agriculture, water, and energy sectors in East and West Africa," accurate weather prediction is crucial for effective agriculture, disaster management, and urban planning, particularly in the context of climate change risks7. An example of the complexity in weather systems is the Lake Effect, explained in "A Hybrid Dataset of Historical Cool-Season Lake Effects From the Eastern Great Lakes of North America". The Lake Effect is a phenomenon where cold winds moving over warmer lake waters, leads to significant changes in weather patterns nearby8. The European Centre for Medium-Range Weather Forecasts (ECMWF) and its ERA5 dataset provide an great opportunity to analyze detailed weather parameters over several decades. This data, examined for its comprehensiveness in the "Evaluation of spatial-temporal variation performance of ERA5 precipitation data in China," is a example for researchers seeking to understand weather patterns and atmospheric dynamics9.
The research is guided by four pivotal questions, addressed through regression and classification analyses.
Research questions: Regression Analysis: Is it possible to accurately predict the temperature and its relationship with wind characteristics using historical data? This question is inspired by the current understanding of atmospheric variables and their interactions and specified as follows:
Temperature Prediction: Is it possible to build an accurate regression model to predict temperature based on historical data? This involves exploring the relationships between temperature and windspeed over time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.
Temperature and Wind Modeling: Is it possible to find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques? This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather. As an outlook to this analysis it could also be analyzed whether certain winddirections result in a certain type of weather event using logarithmic regression or classification.
Multivariate Temperature Prediction / Linear Regression of Temperature with Multiple Predictors: How does the incorporation of multiple atmospheric predictors, such as windspeed, windgust, winddirection, air pressure, snow and ice parameters enhance the accuracy of temperature prediction compared to a model solely based on windspeed? This question delves into the complex interplay of various atmospheric variables and their combined effect on temperature changes over different time scales (daytime, day, season, etc.). The outcome of this analysis will contribute to a nuanced model that considers a spectrum of atmospheric conditions for more precise temperature forecasting. This tasks also includes the regression model to uncover interactions and synergies among different atmospheric predictors in influencing temperature variations. This question seeks to identify potential non-linear relationships or combined effects of multiple predictors on temperature. Understanding these interactions is crucial for refining the predictive model and gaining insights into the intricate dynamics of the atmosphere. Furthermore temporal dynamics are to be analyzed. How does the temporal aspect, including diurnal patterns and seasonal variations, impact the relationship between temperature and multiple predictors? This question aims to capture the temporal dynamics of the atmosphere and assess how the predictive power of the model varies across different timeframes. Examining temporal patterns will contribute to a more nuanced understanding of how atmospheric predictors influence temperature fluctuations under varying conditions.
Extreme Weather Event Prediction by Temperature in Logistic Regression: Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature (or alternatively windspeed) ranges? This question involves categorizing temperature into distinct classes and determining whether certain temperature ranges are indicative of extreme weather events. The outcome of this analysis will contribute to weather an observation is an extreme weather event, providing valuable insights into the correlation between temperature and specific meteorological phenomena. To achieve this, this tasks also includes to find out, what the critical temperature thresholds associated with the onset or intensification of different weather events are. This question aims to identify temperature breakpoints that signify transitions between different weather conditions. Understanding these thresholds is essential for developing early warning systems and enhancing preparedness for weather-related events.
Classification Analysis: Can effective classification and prediction of weather events, including extreme occurrences, be achieved based on multivariate weather data? This aspect of the research aims to develop methods for accurate prediction of weather events, acknowledging the complexity and variability of weather patterns and is specified as follows:
Extreme Weather Events in Binary Classification: Is it possible to classify and predict extreme weather events such as storms? This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.
Weather Event and Pattern Classification in Multiclass Classification: Is it possible to categorize and predict different extreme weather events based on multivariate weather data? This involves using multiclass classification algorithms. The results of this classification analysis is the prediction of certain weather events based on the current weather data and a model that was trained on historical weather data.
The hypotheses are twofold, corresponding to the projects dual analytical approach:
In Regression: There exists a significant correlation between temperature and wind characteristics, which can be modeled to predict future temperature trends and variations. This hypothesis is based on the premise that atmospheric variables are interconnected and can be analyzed to forecast weather conditions.
In Classification: Specific patterns in the weather data can accurately predict various weather events, including extreme conditions. This hypothesis is informed by the need for effective prediction models in the face of increasingly frequent and severe weather events.
de Lima, Glauston, R.T. / Stephan, S. (2013): A new classification approach for detecting severe weather patterns, in: Computers & geosciences 57 (2013): 158-165.↩
ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation↩
Liljequist, G.H. / Cehak, K. (1984): Allgemeine Meteorologie. 3. Auflage, Springer-Verlag.↩
Fathi, M. / Haghi Kashani, M. / Jameii, S. M. / Mahdipour, E. (2022): Big Data Analytics in Weather Forecasting: A Systematic Review, in: Archives of Computational Methods in Engineering 29.2 (2022, Springer): 1247–1275↩
Hjelmfelt, M.R. (1990): Numerical study of the influence of environmental conditions on lake-effect snowstorms over Lake Michigan, in: Monthly Weather Review, 118(1), pp.138-150.↩
Ghirardelli, J.E. (2005): An Overview of the Redeveloped Localized Aviation Mos Program (Lamp) For Short-Range Forecasting.↩
The contribution of weather forecast information to agriculture, water, and energy sectors in East and West Africa↩
A Hybrid Dataset of Historical Cool-Season Lake Effects From the Eastern Great Lakes of North America↩
Evaluation of spatial-temporal variation performance of ERA5 precipitation data in China↩
This project employs a range of statistical and machine learning techniques to analyze the ERA5 dataset. The outcomes of this research are expected to not only enhance meteorological understanding but also provide practical tools for weather prediction, with implications for environmental policy and public safety. By harnessing the power of the ERA5 dataset 1 and advanced analytical techniques 2, this project aims to contribute valuable insights to the field of meteorology and support informed decision-making in a world increasingly affected by weather-related challenges.
This project uses a multifaceted approach, primarily utilizing Design Science Research (DSR) 3 in line with Hesse's framework4. This methodological framework focuses on the creation and critical evaluation of artifacts to address specific problems. In this DSR approach, complex weather phenomena are identified as problems to be addressed. Additionally, iterative prototyping 5 is employed, enabling systematic refinement of models and methods based on continuous evaluation and integration of data-driven insights6. This project combines the DSR cycle by Gregor / Hevner, with iterative prototyping by Wilde / Hess and Goldman / Narayanaswamy. This integration fosters a dynamic environment where each prototype's development and evaluation progressively inform subsequent cycles of design and analysis. This leads into a cycle of artifact creation (in this case, models and algorithms) specifically tailored to analyze weather patterns in Bancroft using the ERA5 dataset. These models will be refined continuously through iterative prototyping, where each iteration's outcomes inform the next cycle, ensuring they are increasingly effective and accurate. The artifacts are then rigorously evaluated against the research questions. The results are evaluated in every iteration using cross-validation by Shao 1993 and Browne 20007. This process is enriched by a comprehensive literature review by Webster / Watson 8, conducted before and during the implementation, ensuring the methods and analyses remain aligned with current meteorological and data science advancements. The amalgamation of DSR, iterative prototyping, cross-validation and literature research forms the foundation of this approach, ensuring a thorough and robust analysis that is well-suited to address the complexities in atmospheric data analysis.
ECMWF (2023b): ERA5: reanalysis datasets for forecasts. URL: https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5↩
Fathi, M. / Haghi Kashani, M. / Jameii, S. M. / Mahdipour, E. (2022): Big Data Analytics in Weather Forecasting: A Systematic Review, in: Archives of Computational Methods in Engineering 29.2 (2022, Springer): 1247–1275↩
Gregor, S. / Hevner, A.R. (2013): Positioning and Presenting Design Science Research for Maximum Impact, in: MIS Quarterly, Jg. 37, Nr. 2, S. 337-355; Hevner, A. / Chatterjee, S. (2010): Design Research in Information Systems, Theory and Practice. Hrsg. von R. Sharda/S. Voß. Bd. 22. Integrated Series in Information Systems. New York, NY, USA: Springer New York, NY.; Hevner, A. / March, S.T. / Park, J. / Ram, S. (2004): Design Science in Information Systems Research, in: MIS Quaterly 28.1, S. 75–105.↩
Wilde, T. and Hess, T., 2007. Forschungsmethoden der wirtschaftsinformatik. Wirtschaftsinformatik, 4(49), pp.280-287.; Goldman, N. and Narayanaswamy, K., 1992, June. Software evolution through iterative prototyping. In Proceedings of the 14th international conference on Software engineering (pp. 158-172).↩
Reflective physical prototyping through integrated design, test, and analysis↩
Shao, J., 1993. Linear model selection by cross-validation. Journal of the American statistical Association, pp.486-494.; Browne, M.W., 2000. Cross-validation methods. Journal of mathematical psychology, 44(1), pp.108-132.↩
Webster, J. / Watson, R.T. (2002): Analyzing the past to prepare for the future: Writing a literature review, in: MIS quarterly. Jun 1: xiii-xiii.↩
There are two code blocks with imports in the notebook. The imports in import1 (the code below) encompass all the imports needed in chapters 3 and 4. At the beginning of chapter 5, there is a second import block (import2), which consolidates the imports specific to chapter 5. If an import is only needed once, it is included at the place where it is used.
#import basic dependencies (import1)
import pandas as pd
import numpy as np
import datetime
import math
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import warnings
#warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")
df_full = pd.read_csv("../data/external/feature_data_substation_bancroft_labelled.csv")
df_full.head()
| Unnamed: 0 | substation | run_datetime | valid_datetime | horizon | avg_temp | avg_windspd | avg_windgust | avg_pressure_change | avg_snow | ... | MA_avg_winddir_month | MA_avg_temp_change_month | label0 | label1 | label2 | label3 | wep | storm_id2 | year | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Bancroft | 2015-07-15 00:00:00 | 2015-07-15 00:00:00 | 0.0 | 287.389224 | 3.386380 | 11.136197 | 52.892217 | 0.0 | ... | 80.302464 | NaN | 0 | 1 | NaN | 1 | Blue sky day | NaN | 2015 | 7 |
| 1 | 1 | Bancroft | 2015-07-15 01:00:00 | 2015-07-15 01:00:00 | 0.0 | 287.378997 | 3.326687 | 11.002795 | 50.256685 | 0.0 | ... | 78.584418 | -0.010227 | 0 | 1 | NaN | 1 | Blue sky day | NaN | 2015 | 7 |
| 2 | 2 | Bancroft | 2015-07-15 02:00:00 | 2015-07-15 02:00:00 | 0.0 | 287.388845 | 3.243494 | 10.700595 | 47.944054 | 0.0 | ... | 77.809235 | -0.000189 | 3 | 1 | NaN | 1 | Blue sky day | NaN | 2015 | 7 |
| 3 | 3 | Bancroft | 2015-07-15 03:00:00 | 2015-07-15 03:00:00 | 0.0 | 287.427324 | 3.145505 | 10.323983 | 45.855264 | 0.0 | ... | 77.931830 | 0.012700 | 2 | 1 | NaN | 1 | Blue sky day | NaN | 2015 | 7 |
| 4 | 4 | Bancroft | 2015-07-15 04:00:00 | 2015-07-15 04:00:00 | 0.0 | 287.489158 | 3.047607 | 9.921157 | 44.823453 | 0.0 | ... | 79.272034 | 0.024983 | 2 | 1 | NaN | 1 | Blue sky day | NaN | 2015 | 7 |
5 rows × 184 columns
The ERA5 dataset is provided by the European Centre for Medium-Range Weather Forecasts (ECMWF). It is a product of atmospheric reanalysis, a process that involves assimilating observational data from various sources into a numerical weather prediction model.
Temporal Aspect: The dataset provides temporal information at various intervals (e.g., hourly or monthly), allowing for both short-term and long-term analyses.
Spatial Aspect: With a spatial resolution of around 31 km, the dataset provides a comprehensive global coverage, facilitating region-specific studies. In this project a geographical resolution for the region Bancroft in Ontario, Canada is chosen (as this area is of particular interest for IBM).
Multivariate Nature: Multiple meteorological parameters are available, enabling a comprehensive analysis of weather patterns.
Quality: ERA5 is known for its high quality and precision, making it suitable for detailed analyses and modeling.
The ERA5 dataset is comprehensive, providing a wide range of meteorological variables that cover various aspects of the Earth's atmosphere. The observations and general characteristics measured in the ERA5 dataset are crucial for understanding and analyzing weather and climate patterns. Some of the key variables included in the dataset are1:
Temperature: ERA5 provides information on air temperature at different levels of the atmosphere. This includes 2-meter temperature, which is often used as a representative of surface temperature.
Wind: Wind speed, gust and direction are essential meteorological parameters. ERA5 includes information on both zonal (east-west) and meridional (north-south) components of the wind at different altitudes.
Precipitation: Precipitation data includes rainfall and snowfall. This information is crucial for understanding water cycles and is vital for various applications, including hydrology and agriculture.
Pressure: Atmospheric pressure at different levels is provided in the dataset. Changes in atmospheric pressure are associated with weather patterns and can influence local weather conditions.
Humidity: Relative humidity and specific humidity are included, providing insights into the moisture content of the atmosphere. This is important for understanding cloud formation and precipitation processes.
Snow and ice: Snow density as well as cumulative snow and ice are included.
Other Atmospheric Parameters
These variables collectively provide a detailed snapshot of the Earth's atmosphere and are fundamental for conducting both regression and classification analyses. Researchers and meteorologists often leverage these variables to study climate trends, weather patterns, and environmental changes. The high spatial and temporal resolution of the ERA5 dataset enhances its utility for diverse applications, including climate research, environmental monitoring, and weather forecasting.
The data population of the ERA5 dataset refers to the entirety of the atmospheric observations encompassed by the ERA5 dataset. It includes all global weather data recorded and simulated by the ERA5 reanalysis project. The population covers a vast range of geographical locations, spanning the entire globe, and spans several decades, as reanalysis data provides long-term climate records 2.
The used sample for this project is a subset of the data population that is extracted and analyzed for the specific study and investigation. The sample is tailored based on temporal, spatial, or variable-specific criteria. In this instance the sample subset corresponds to the region of Bancroft in Ontario, Canada for the timespan of 2015 to 2022. For this region and timespan the data is holistic and and focusing on the meteorological parameters stated under 7. Working with this specific sample allows to address the specific research, conducting exploratory analyses and creating targeted visualizations for only the interesting area while also saving computing resources by not calculating on redundant data or falsify the analysis by using irrelevant data for the research goal.
In the following code snippets the data will be transformed to be able to better visualize and analyse it. Some features might be recalculated (e.g., by calculating the average for every week for visualization).
At first the data is loaded from the .csv file and the head of the data is printed to have a first look at the columns (variables) and rows (observations) of the data. The dataset consists of 65345 observations, 184 columns (including the useful and unique predictor variables and reasonable response variable mentioned in chapter "Variables / parameters", identifier variables "Unnamed: 0" (this might previously be used as an index and will be dropped for the further analysis becuase of redundancy), run_datetime and valid_datetime (which contain the datetime for the observation, which is not only an identifier variable but can also be used in the analysis e.g., time series analysis)). In a next step the distribution and statistics of the variables will be taken a closer look at by analyzing their quantiles, minima, maxima and maxima.
Collection Method 3:
Observational Data Assimilation: The ERA5 dataset is generated through a process called reanalysis. This involves assimilating vast amounts of observational data from satellites, weather stations, and other sources into a numerical weather prediction model. This assimilation process helps create a consistent and comprehensive representation of the atmosphere over time.
Numerical Weather Prediction Model: ECMWF uses advanced numerical weather prediction models to simulate the Earth's atmosphere. These models take into account the laws of physics governing the atmosphere and use initial conditions based on observational data.
Temporal Coverage: The ERA5 dataset spans multiple decades, with ongoing updates. It covers from 1979 to near-real-time, providing a continuous record of atmospheric conditions. In this project the scope of the data is reduced to the dates of 2015 to 2022 for reasons of resources and project limitations.
Spatial Resolution: The dataset has a high spatial resolution, offering detailed information on a global scale, with grid points approximately 31 kilometers apart.
Purpose: The primary purpose of generating the ERA5 dataset is to provide a comprehensive and accurate representation of past weather conditions. It serves various scientific and operational applications, including climate monitoring, research, and supporting weather forecasting.
Quality Assurance: ECMWF is renowned for its commitment to data quality. The assimilation process involves rigorous validation against a wide range of observational data to ensure the accuracy and reliability of the reanalysis output.
Accessibility: The ECMWF makes ERA5 data available to the public and the scientific community, fostering research and applications in meteorology, climate science, and related fields.
ECMWF (2023c): ERA5: data documentation parameterlistings. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings↩
ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation↩
ECMWF (2023a): ERA5: data documentation. URL: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation↩
print(df_full.info())
print(df_full.dtypes)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65345 entries, 0 to 65344
Columns: 184 entries, Unnamed: 0 to month
dtypes: float64(166), int64(8), object(10)
memory usage: 91.7+ MB
None
Unnamed: 0 int64
substation object
run_datetime object
valid_datetime object
horizon float64
...
label3 int64
wep object
storm_id2 object
year int64
month int64
Length: 184, dtype: object
# Change datetime columns to datetime format
df_full['run_datetime'] = pd.to_datetime(df_full['run_datetime'])
df_full['valid_datetime'] = pd.to_datetime(df_full['valid_datetime'])
# Add temperature column for degree in celsius
df_full['avg_temp_celsius'] = df_full['avg_temp'] - 273.15
# Translate wind directions into cardinal directions
def wind_direction_label(degrees):
if 22.5 < degrees <= 67.5:
return 'Northeast'
elif 67.5 < degrees <= 112.5:
return 'East'
elif 112.5 < degrees <= 157.5:
return 'Southeast'
elif 157.5 < degrees <= 202.5:
return 'South'
elif 202.5 < degrees <= 247.5:
return 'Southwest'
elif 247.5 < degrees <= 292.5:
return 'West'
elif 292.5 < degrees <= 337.5:
return 'Northwest'
else:
return 'North'
df_full['wind_direction_label'] = df_full['avg_winddir'].apply(wind_direction_label)
In the pursuit of extracting meaningful insights and constructing predictive models from a given dataset, it is imperative to establish a comprehensive understanding of the variables involved. This section delineates the variables within the dataset, providing a structured overview of each variable's key characteristics through a data dictionary. The data dictionary encapsulates essential information for each variable, facilitating a streamlined comprehension of their names, descriptions, roles, types, and formats.
While not explicitly used as predictors or responses in modeling, the inclusion of ID columns can offer valuable insights for data exploration and understanding. These columns serve as unique identifiers and are crucial for tracking and referencing specific records within the dataset.
The meticulous documentation provided in this variable list serves as a foundation for subsequent analyses, ensuring transparency, reproducibility, and informed decision-making throughout the data exploration and modeling processes.
data = {
'Name': [
'run_datetime',
'wep',
'avg_temp',
'min_wet_bulb_temp',
'avg_dewpoint',
'avg_temp_change',
'avg_windspd',
'max_windgust',
'avg_winddir',
'wind_direction_label',
'max_cumulative_precip',
'max_snow_density_6',
'max_cumulative_snow',
'max_cumulative_ice',
'avg_pressure_change'
],
'Description': [
'Date and time when the weather observations were recorded.',
'Weather Event Type (WEP) is a categorization of weather conditions based on specific parameters at a given location and time.',
'The average temperature measured at two meters above ground level, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Minimum wet bulb temperature recorded during the observation period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Average dewpoint temperature observed during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Average change in temperature during the observation period, obtained by calculating the difference between this observation and the following.',
'Average wind speed measured during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Maximum wind gust observed during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Average wind direction (in degree) observed during the recording period, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Wind direction (in cardinal direction) observed during the recording period, obtained by recalculating the avg_winddir parameter.',
'Maximum cumulative precipitation recorded, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Maximum snow density at a depth of 6 inches, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Maximum cumulative snow recorded, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Maximum cumulative ice recorded, considering all sensors, for the entire duration of one hour in the Bancroft region.',
'Average change in atmospheric pressure during the observation period, considering all sensors, for the entire duration of one hour in the Bancroft region.'
],
'Role': [
'ID / predictor',
'response',
'response / predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor',
'predictor'
],
'Type': [
'numerical continuous / ID',
'categorical nominal',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'categorical ordinal',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'numerical continuous',
'numerical continuous'
],
'Format': [
type(df_full['run_datetime'][0]),
type(df_full['wep'][0]),
type(df_full['min_wet_bulb_temp'][0]),
type(df_full['avg_temp'][0]),
type(df_full['avg_dewpoint'][0]),
type(df_full['avg_temp_change'][0]),
type(df_full['avg_windspd'][0]),
type(df_full['max_windgust'][0]),
type(df_full['avg_winddir'][0]),
type(df_full['wind_direction_label'][0]),
type(df_full['max_cumulative_precip'][0]),
type(df_full['max_snow_density_6'][0]),
type(df_full['max_cumulative_snow'][0]),
type(df_full['max_cumulative_ice'][0]),
type(df_full['avg_pressure_change'][0])
]
}
data_dict_df = pd.DataFrame(data)
# Display the data dictionary
data_dict_df
| Name | Description | Role | Type | Format | |
|---|---|---|---|---|---|
| 0 | run_datetime | Date and time when the weather observations we... | ID / predictor | numerical continuous / ID | <class 'pandas._libs.tslibs.timestamps.Timesta... |
| 1 | wep | Weather Event Type (WEP) is a categorization o... | response | categorical nominal | <class 'str'> |
| 2 | avg_temp | The average temperature measured at two meters... | response / predictor | numerical continuous | <class 'numpy.float64'> |
| 3 | min_wet_bulb_temp | Minimum wet bulb temperature recorded during t... | predictor | numerical continuous | <class 'numpy.float64'> |
| 4 | avg_dewpoint | Average dewpoint temperature observed during t... | predictor | numerical continuous | <class 'numpy.float64'> |
| 5 | avg_temp_change | Average change in temperature during the obser... | predictor | numerical continuous | <class 'numpy.float64'> |
| 6 | avg_windspd | Average wind speed measured during the recordi... | predictor | numerical continuous | <class 'numpy.float64'> |
| 7 | max_windgust | Maximum wind gust observed during the recordin... | predictor | numerical continuous | <class 'numpy.float64'> |
| 8 | avg_winddir | Average wind direction (in degree) observed du... | predictor | numerical continuous | <class 'numpy.float64'> |
| 9 | wind_direction_label | Wind direction (in cardinal direction) observe... | predictor | categorical ordinal | <class 'str'> |
| 10 | max_cumulative_precip | Maximum cumulative precipitation recorded, con... | predictor | numerical continuous | <class 'numpy.float64'> |
| 11 | max_snow_density_6 | Maximum snow density at a depth of 6 inches, c... | predictor | numerical continuous | <class 'numpy.float64'> |
| 12 | max_cumulative_snow | Maximum cumulative snow recorded, considering ... | predictor | numerical continuous | <class 'numpy.float64'> |
| 13 | max_cumulative_ice | Maximum cumulative ice recorded, considering a... | predictor | numerical continuous | <class 'numpy.float64'> |
| 14 | avg_pressure_change | Average change in atmospheric pressure during ... | predictor | numerical continuous | <class 'numpy.float64'> |
The variable temperature is a numerical continous variable, which is measured in units of Kelvin.
Temperature is a measure of the thermal energy in a substance or system. It is a fundamental physical quantity. Temperature is a measure of the average kinetic energy of the particles in a substance or system. As the temperature of a substance increases, the particles in the substance gain kinetic energy and move faster. In the dataframe there are different kinds of temperature given.
The weather event is a categorical nominal variable with the following values:
Blue Sky Day: A day with clear and cloudless skies. There are no significant weather phenomena such as rain, snow, or storms.1
Mild Snowfall: A light snowfall, where small amounts of snow fall, but not enough to be considered moderate or heavy snowfall.
Moderate Snowfall: A moderate snowfall, where larger amounts of snow fall compared to mild snowfall but not as much as heavy snowfall.
Moderate Rain: A moderate rainfall, where a moderate amount of precipitation occurs in the form of rain.
Heavy Snowfall with Accumulated Snow: Intense snowfall accompanied by a significant accumulation of snow on the ground.
Continuous Freezing Rain: Continuous freezing rain, where rain falls on frozen surfaces and immediately turns into ice.
Storm with Freezing Rain / Heavy Snow- and Icestorm: A storm with freezing rain or heavy snowfall, accompanied by the formation of ice. This can lead to hazardous conditions.
Snowstorm with High Precipitation: A snowstorm with intense snowfall and a high amount of precipitation in the form of snow.
However, when conducting weather analyses and predictions, primary focus lies on the examination and anticipation of extreme weather conditions. Therefore it is not neccessary to differentiate between degrees of cloudyness.
It is important to note that the designation "blue sky" encapsulates a range of commonplace weather situations. Consequently, even on a day characterized as a "blue sky" day there may be a degree of cloudiness.↩
This chapter delves into the crucial process of data splitting in machine learning, focusing on strategies for effectively partitioning datasets into training, testing, and validation sets. A hands-on implementation using the scikit-learn library is presented, highlighting the significance of proper data splitting for robust model evaluation. The chapter concludes with discussions on potential challenges and best practices 1.
Data splitting is a fundamental step in machine learning model development, playing a pivotal role in assessing model performance. The primary goal is to strike a balance between training a model that captures underlying patterns in the data and evaluating its generalization on unseen data. The need for data splitting arises from the desire to create models that generalize well to new, unseen instances. Overfitting, where a model memorizes training data but fails on new data, is a common concern. Conversely, underfitting occurs when a model is too simplistic to capture underlying patterns 2.
Before splitting the data intro training, test and validation data, the dataframe is reduced to only the needed columns to save resources.
Random Splitting: Randomly partitioning the data into training and testing sets is a straightforward strategy. This method is suitable for well-shuffled datasets, ensuring a representative sample for both training and testing.
Stratified Splitting: In classification tasks, maintaining the original class distribution is crucial. Stratified splitting ensures that the proportion of each class remains consistent in both training and testing sets, preventing bias.
Temporal Splitting: For time-series data, temporal splitting is essential. It involves training on earlier time periods and testing on later ones, mimicking real-world scenarios where models encounter new data sequentially.
Implementation with scikit-learn
The scikit-learn library provides a user-friendly interface for data splitting. The train_test_split function allows easy division of datasets into training and testing sets, with an optional validation set.
Overfitting and Underfitting: Care must be taken to strike a balance between overfitting and underfitting during model training. Regularization techniques and proper hyperparameter tuning can mitigate these issues.
Hyperparameter Tuning: Hyperparameter tuning is an ongoing process, and the performance of models may be sensitive to changes in hyperparameters. Robust evaluation using validation sets aids in identifying optimal hyperparameter configurations.
Conclusion Proper data splitting is foundational for developing reliable machine learning models. The presented strategies, coupled with the scikit-learn implementation, empower practitioners to create models that generalize well to new data. The chapter underscores the importance of thoughtful data splitting in the pursuit of effective machine learning models.
Scikit-learn (2023a): https://scikit-learn.org/stable/documentation.html↩
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.↩
#Creating a new dataframe with only the used columns (saves resources)
columns_to_keep = [
'run_datetime',
'wep',
'avg_temp',
'avg_temp_celsius',
'min_wet_bulb_temp',
'avg_dewpoint',
'avg_temp_change',
'avg_windspd',
'max_windgust',
'avg_winddir',
'avg_winddir_sin',
'avg_winddir_cos',
'wind_direction_label',
'max_cumulative_precip',
'max_snow_density_6',
'max_cumulative_snow',
'max_cumulative_ice',
'avg_pressure_change',
'label0', #additional wep labels for various blue sky events
'label1', #additional wep labels to differentiate between blue sky and extreme weather events
'label2'
]
# Use loc to select only the specified columns
df = df_full.loc[:, columns_to_keep]
df
| run_datetime | wep | avg_temp | avg_temp_celsius | min_wet_bulb_temp | avg_dewpoint | avg_temp_change | avg_windspd | max_windgust | avg_winddir | ... | avg_winddir_cos | wind_direction_label | max_cumulative_precip | max_snow_density_6 | max_cumulative_snow | max_cumulative_ice | avg_pressure_change | label0 | label1 | label2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-07-15 00:00:00 | Blue sky day | 287.389224 | 14.239224 | 280.809506 | 280.735246 | NaN | 3.386380 | 14.899891 | 80.302464 | ... | 0.190676 | East | 2.009 | 0.0 | 0.000 | 0.0 | 52.892217 | 0 | 1 | NaN |
| 1 | 2015-07-15 01:00:00 | Blue sky day | 287.378997 | 14.228997 | 280.809506 | 280.414058 | -0.010227 | 3.326687 | 14.899891 | 76.866373 | ... | 0.102466 | East | 1.209 | 0.0 | 0.000 | 0.0 | 50.256685 | 0 | 1 | NaN |
| 2 | 2015-07-15 02:00:00 | Blue sky day | 287.388845 | 14.238845 | 280.809506 | 280.187074 | 0.009848 | 3.243494 | 14.899891 | 76.258867 | ... | 0.651950 | East | 0.400 | 0.0 | 0.000 | 0.0 | 47.944054 | 3 | 1 | NaN |
| 3 | 2015-07-15 03:00:00 | Blue sky day | 287.427324 | 14.277324 | 280.809506 | 280.049330 | 0.038479 | 3.145505 | 14.899891 | 78.299616 | ... | -0.971290 | East | 0.000 | 0.0 | 0.000 | 0.0 | 45.855264 | 2 | 1 | NaN |
| 4 | 2015-07-15 04:00:00 | Blue sky day | 287.489158 | 14.339158 | 280.809506 | 279.980697 | 0.061834 | 3.047607 | 14.702229 | 84.632852 | ... | -0.981976 | East | 0.000 | 0.0 | 0.000 | 0.0 | 44.823453 | 2 | 1 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65340 | 2022-12-27 04:00:00 | Moderate rain | 264.241641 | -8.908359 | 260.284794 | 262.061976 | -0.124561 | 1.962197 | 8.444256 | 232.606824 | ... | 0.991695 | Southwest | 2.126 | 0.0 | 25.643 | 0.0 | NaN | 5 | 0 | 3.0 |
| 65341 | 2022-12-27 05:00:00 | Blue sky day | 264.115391 | -9.034609 | 260.284794 | 262.114357 | -0.126250 | 1.978823 | 7.475906 | 229.938704 | ... | -0.823955 | Southwest | 2.226 | 0.0 | 21.161 | 0.0 | NaN | 5 | 1 | NaN |
| 65342 | 2022-12-27 06:00:00 | Blue sky day | 264.024853 | -9.125147 | 260.284794 | 262.206179 | -0.090537 | 2.005855 | 7.305549 | 227.024163 | ... | 0.675251 | Southwest | 2.426 | 0.0 | 16.430 | 0.0 | NaN | 5 | 1 | NaN |
| 65343 | 2022-12-27 07:00:00 | Blue sky day | 264.048368 | -9.101632 | 260.284794 | 262.350025 | 0.023514 | 2.040978 | 7.305549 | 223.900355 | ... | -0.662027 | Southwest | 2.826 | 0.0 | 10.859 | 0.0 | NaN | 5 | 1 | NaN |
| 65344 | 2022-12-27 08:00:00 | Blue sky day | 263.918722 | -9.231278 | 260.284794 | 262.512490 | -0.129646 | 2.078741 | 6.818578 | 220.894487 | ... | 0.554528 | Southwest | 3.426 | 0.0 | 5.640 | 0.0 | NaN | 0 | 1 | NaN |
65345 rows × 21 columns
from sklearn.model_selection import StratifiedShuffleSplit
# Initialize StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# Split the data using StratifiedShuffleSplit
for train_index, test_temp_index in sss.split(df, df['wep']):
train_data = df.iloc[train_index]
temp_data = df.iloc[test_temp_index]
# Further split temp_data into test and validation sets
for test_index, validation_index in sss.split(temp_data, temp_data['wep']):
test_data = temp_data.iloc[test_index]
validation_data = temp_data.iloc[validation_index]
# Display the shapes of the resulting sets
print("Train Data Shape:", train_data.shape)
print("Test Data Shape:", test_data.shape)
print("Validation Data Shape:", validation_data.shape)
# Display the distribution of labels 'wep'
print("\nDistribution of 'wep' in Train Data:")
print(train_data['wep'].value_counts())
print("\nDistribution of 'wep' in Test Data:")
print(test_data['wep'].value_counts())
print("\nDistribution of 'wep' in Validation Data:")
print(validation_data['wep'].value_counts())
Train Data Shape: (52276, 21) Test Data Shape: (10455, 21) Validation Data Shape: (2614, 21) Distribution of 'wep' in Train Data: wep Blue sky day 42106 Mild snowfall 3598 Moderate snowfall 2336 Moderate rain 2104 Heavy snowfall with accumulated snow 832 Frondurchlauf / Continuous freezing rain 747 Storm with freezing rain / Heavy snow- and icestorm 344 Snowstorm with high precipitation 209 Name: count, dtype: int64 Distribution of 'wep' in Test Data: wep Blue sky day 8421 Mild snowfall 719 Moderate snowfall 467 Moderate rain 421 Heavy snowfall with accumulated snow 166 Frondurchlauf / Continuous freezing rain 150 Storm with freezing rain / Heavy snow- and icestorm 69 Snowstorm with high precipitation 42 Name: count, dtype: int64 Distribution of 'wep' in Validation Data: wep Blue sky day 2106 Mild snowfall 180 Moderate snowfall 117 Moderate rain 105 Heavy snowfall with accumulated snow 42 Frondurchlauf / Continuous freezing rain 37 Storm with freezing rain / Heavy snow- and icestorm 17 Snowstorm with high precipitation 10 Name: count, dtype: int64
df.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| run_datetime | 65345 | 2019-04-06 14:09:11.362766848 | 2015-07-15 00:00:00 | 2017-05-25 23:00:00 | 2019-04-07 01:00:00 | 2021-02-14 16:00:00 | 2022-12-27 08:00:00 | NaN |
| avg_temp | 65345.0 | 279.574328 | 243.849393 | 271.114219 | 279.882735 | 289.903226 | 300.934144 | 11.383325 |
| avg_temp_celsius | 65345.0 | 6.424328 | -29.300607 | -2.035781 | 6.732735 | 16.753226 | 27.784144 | 11.383325 |
| min_wet_bulb_temp | 65345.0 | 273.601253 | 239.056465 | 265.232651 | 274.245761 | 284.128378 | 294.642287 | 11.914203 |
| avg_dewpoint | 65345.0 | 274.972135 | 238.429236 | 267.161062 | 275.429991 | 284.366793 | 294.97882 | 10.93508 |
| avg_temp_change | 65344.0 | -0.000361 | -1.168328 | -0.095711 | 0.012257 | 0.10334 | 1.104699 | 0.191027 |
| avg_windspd | 65345.0 | 2.440013 | 0.643393 | 1.882336 | 2.34648 | 2.893087 | 6.470306 | 0.77288 |
| max_windgust | 65345.0 | 12.659095 | 3.918856 | 9.966063 | 12.266102 | 14.895518 | 28.236084 | 3.721124 |
| avg_winddir | 65345.0 | 210.247462 | 21.17728 | 165.321173 | 222.129253 | 262.922675 | 346.805959 | 70.896648 |
| avg_winddir_sin | 65345.0 | -0.008188 | -1.0 | -0.715716 | -0.012884 | 0.699482 | 1.0 | 0.707159 |
| avg_winddir_cos | 65345.0 | 0.003304 | -1.0 | -0.704408 | 0.008696 | 0.708963 | 1.0 | 0.70701 |
| max_cumulative_precip | 65345.0 | 4.796664 | 0.0 | 0.1 | 1.528 | 6.2 | 83.471 | 7.64858 |
| max_snow_density_6 | 65345.0 | 21.123859 | 0.0 | 0.0 | 0.0 | 0.0 | 722.582771 | 67.439046 |
| max_cumulative_snow | 65345.0 | 17.454941 | 0.0 | 0.0 | 0.0 | 8.257 | 920.464 | 53.768892 |
| max_cumulative_ice | 65345.0 | 0.056319 | 0.0 | 0.0 | 0.0 | 0.0 | 9.955316 | 0.405855 |
| avg_pressure_change | 56672.0 | 0.017391 | -168.315674 | -19.263538 | -0.071299 | 19.549097 | 153.811455 | 33.88573 |
| label0 | 65345.0 | 2.027072 | 0.0 | 0.0 | 2.0 | 3.0 | 7.0 | 1.933712 |
| label1 | 65345.0 | 0.805463 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.395847 |
| label2 | 12712.0 | 3.06191 | 0.0 | 1.0 | 3.0 | 5.0 | 6.0 | 2.126446 |
After getting a closer overlook over the data some basic explorative data analysis will be conducted to visualize the mean, quantiles, shape, distribution, variance, etc. of the data and analyze the correlation of the variables.
At first, the most relevant predictor and response variables are calculated (already done in previous work of IBM with the dataset) in order to visualize their development over time in a time series analysis. The temperature measurements average temperature (avg_temp), minimal wet bulb temperature (min_wet_bulb_temp) and average temperature change (avg_temp_change) are shown in Kelvin. Measurements of wind are given as the average windspeed (avg_windspd), the maximal windgust (max_windgust) and the average winddirection (avg_winddir). Weather events are measured as maximal cumulaive precipation (max_cum_precip), maximal snow density (max_snow_density_6), maximal cumulative snow (mamax_cumulative_snow), maximal cumulative ice (max_cumulative_ice) and the average pressure change (avg_pressure_change).
import matplotlib.dates as mdates
rcParams['figure.figsize'] = 20, 15
sns.set(style="darkgrid")
parameters = [
('avg_temp', 'Kelvin'),
('min_wet_bulb_temp', 'Kelvin'),
('avg_dewpoint', 'Kelvin'),
('avg_temp_change', 'Kelvin'),
('avg_windspd', 'm/s'),
('max_windgust', 'm/s'),
('avg_winddir', 'degrees'),
('max_cumulative_precip', 'mm'),
('max_snow_density_6', 'kg/m^3'),
('max_cumulative_snow', 'mm'),
('max_cumulative_ice', 'mm'),
('avg_pressure_change', 'Pa')
]
def calculate_moving_averages(df, variables, time_periods): # has already been done in previous work at IBM (by Julian Erath) for the dataset # selected_variables = ['max_windgust', 'min_wet_bulb_temp', 'max_snow_density_6', 'max_cumulative_precip', 'max_cumulative_ice', 'max_cumulative_snow', 'avg_windspd','avg_pressure_change','avg_temp','avg_dewpoint','avg_winddir','avg_temp_change'] # selected_time_periods_week = [168] # selected_time_periods_month = [720]
for variable in variables:
for period in time_periods:
ma_column_name = f'MA_{variable}_{period}_hours'
df[ma_column_name] = df[variable].rolling(period, min_periods=1).mean()
def plot_weather_parameters(ax, data, parameter, unit):
ax.plot(data.run_datetime, data[parameter], label=f'{parameter} Actual')
ax.plot(data.run_datetime, data[f"MA_{parameter}_week"], label=f'{parameter} Weekly MA')
ax.plot(data.run_datetime, data[f"MA_{parameter}_month"], label=f'{parameter} Monthly MA')
ax.set_title(f'Bancroft {parameter}')
ax.set_ylabel(unit)
# Set the x-axis major locator to YearLocator and the formatter to the year
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
# Manually add the mid-year tick for July 1st
mid_years = [mdates.date2num(datetime.datetime(year, 7, 1)) for year in range(data.run_datetime.dt.year.min(), data.run_datetime.dt.year.max() + 1)]
ax.set_xticks(list(ax.get_xticks()) + mid_years, minor=True)
ax.xaxis.set_minor_formatter(mdates.DateFormatter('/%m'))
# Rotate the labels for better visibility
plt.setp(ax.xaxis.get_majorticklabels(), rotation=90, ha='right')
def create_subplots(data):
fig, axes = plt.subplots(4, 3, figsize=[20,15])
for i, (parameter, unit) in enumerate(parameters):
plot_weather_parameters(axes[i // 3, i % 3], data, parameter, unit)
plt.suptitle('Bancroft weather parameters time series', fontsize=24)
plt.tight_layout()
# Update the legend to handle multiple lines
handles, labels = axes[0, 0].get_legend_handles_labels()
new_labels = [label.replace("avg_temp", "Average Temp.").replace("MA", "Moving Average") for label in labels]
fig.legend(handles, new_labels, loc='upper right', ncol=3)
plt.show()
plt.savefig('../output/Bancroft weather parameters time series.png')
create_subplots(df_full)
<Figure size 2000x1500 with 0 Axes>
In these plots, the blue line represents the actual data points, capturing the daily variability. The orange line indicates the weekly moving average, smoothing short-term fluctuations and making it easier to observe the weekly trends. The green line shows the monthly moving average, providing a clear view of the longer-term trends and seasonal patterns. The differences in color help distinguish the immediate data from its broader trends and cycles. Weekly Moving Averages (orange line), smooth out short-term fluctuations and make weekly trends easier to see. They still follow the daily data points (blue line) closely, but with fewer peaks, which is a reduction in daily variability. Monthly Moving Averages (green line) further smooth the curve and highlight longer-term trends and seasonal patterns. These are useful for understanding the general direction of temperature data over the year. They fluctuate less than the weekly MAs and can better show the summer peaks and winter troughs of the years.
average temperature, minimal wet bulb temperature, average dewpoint, average temperature change The first four diagrams illustrate temperature-related parameters for Bancroft: average temperature (avg_temp), minimal wet bulb temperature (min_wet_bulb_temp), average dewpoint (avg_dewpoint) and average temperature change (avg_temp_change), all measured in Kelvin. Each plot reveals a clear seasonal pattern, with one oscillation corresponding to one year, indicating a repetitive cycle of temperature variation across seasons. The first three plots show a sinusoidal-like pattern that repeats annually, with peaks representing the warmer summer months and troughs corresponding to the colder winter months. It seems that the coldest months are those, around New Year, while the warmest months are July and August. This suggests a strong seasonal influence on temperatures, as expected in most climatic regions. The weekly and monthly MAs in all three plots show similar seasonal patterns with highest values in summer and lowest values in winter, indicating a strong seasonal dependence of temperatures as well. They seem similar in their seasonal patterns as well, with their peaks and troughs aligning closely. This indicates a strong correlation between them. The consistency across the plots suggests that these temperature parameters are influenced by the same seasonal variables and possibly by each other.
The average temperature change (avg_temp_change) represents the hourly direction and Kelvin of temperature change. This plot shows a similar amount of repetitive cycles as well, indicating a correlation to the other temperature variables. The average temperature change plot shows, despite the daily fluctuations, that the general pattern of temperature changes remains stable over the year and is in line with the seasonal fluctuations of the other temperature parameters. It is interesting that the values are highest in the cold winter months and fall in the warm months. This suggests an inverse correlation with average temperature, minimal wet bulb temperature and average dewpoint.
Average Wind Speed and Maximum Wind Gusts The average wind speed (avg_windspd) and maximum wind gust (max_windgust) are quantified in meters per second. Both plots show substantial day-to-day variability with many peaks, which may correspond to transient weather events such as storms or frontal systems. The monthly moving averages in both wind speed plots do not show a clear seasonal trend, suggesting that wind speeds in Bancroft do not follow a simple seasonal pattern. Instead, they fluctuate throughout the year, possibly due to the region's complex weather dynamics. It is not possible to determine a trend or peaks and troughs across the years. However, a strong correlation between the average wind speeds and maximum wind gusts is seen. Days with higher average wind speeds tend to also experience more intense wind gusts, as shown by the concurrent peaks across both plots. The average wind speed plot seems volatile, the maximum wind gust plot reveals even a higher degree of volatility across the lines, with some gusts reaching well above the average range. This highlights the possible unpredictable nature of wind attributes, which can pose challenges for weather prediction and risk management.
Average Wind Direction (avg_winddir) Analysis Complementing the analysis of average wind speed and maximum wind gusts, the average wind direction offers insights into the directional aspect of wind patterns. The average wind direction is measured in degrees, indicating the direction from which the wind originates. The plot shows high day-to-day variability. There is no clear seasonal pattern in wind direction observable as well, suggesting that, like wind speeds, the direction for the region Bancroft is influenced by complex weather dynamics rather than a simple seasonal pattern. This might be attributed to the region's geography or the interaction between local and larger-scale climate patterns.
When looking at the diagrams for the average wind direction, the average wind speed and the maximum wind gusts, the high volatility of all three variables is striking. The average wind direction as a vector variable shows an even greater range of fluctuation, as each value represents a different actual direction and therefore no clear "average value" can be read off. This inherent volatility makes it difficult to visually determine whether there is a direct correlation between the variables. The high fluctuation in the data makes it difficult to identify high and low points. Although it seems possible that certain wind directions could correspond to stronger or weaker wind speeds and gusts - due to geographical or meteorological factors - a visual analysis alone is not sufficient to confirm a correlation. In order to be able to make a reliable statement about the existence of a correlation, further statistical tests are required. These could include cross-correlation analyses, which show how changes in one variable are related to changes in another over time.
Bancroft Maximum Cumulative Snow, Snow Density and Ice Formation Observing the maximum cumulative snow (in millimeters), snow density (in kilograms per cubic meter) and ice formation plot (in millimeters), it is noticeable that these plots display a pronounced correlation with seasonal cycles with peaks typically in the winter months. The blue line captures individual snow events, with the single-day variations giving way to a smoothed seasonal trend depicted by the weekly (orange) and monthly (green) moving averages. Notably, these plots show an inverse correlation to the temperature related graphs. The denser snow readings coincide with higher snowfall accumulations, suggesting that subsequent snowfalls contribute to the compaction and increase in snow density. The moving averages, particularly the green lines, underscore the annual rhythm of snowfall and its gradual compaction into denser snowpacks.
Bancroft Average Pressure Change The Average Pressure Change plot, illustrating the average pressure change (in Pascals), exhibits seasonal variation. Compared to the other variables, there is a clear correlation between temperature and snow variables. This plot looks almost identical to the "average temperature change" plot, which is a indicator for a high correlation. This means, that a high pressure change could lead to a high temperature change. The moving averages show a smoothing effect and highlight the strong seasonal pattern. This suggests that pressure changes in Bancroft are likely driven by seasonal influences that dictate like temperature, snow, and ice. Other variables like patternstransient and more complex weather systems could have a great influence as well. It is also apparent, that the data for average pressure change is missing for the year 2022. For some analysis including this parameter, the data for the year 2022 needs to ne dropped.
Insights: Understanding the Meteorological Dynamics of Bancroft
Seasonal Temperature Patterns: A distinct annual cyclicity in temperature parameters is evident, with higher values in the summer and lower in the winter. The moving averages, both weekly and monthly, effectively smooth out daily variances, reinforcing the observed seasonal trends.
Wind Characteristics: Wind speed and gusts display high day-to-day variability, lacking clear seasonal trends. This suggests that Bancroft's wind patterns are subject to complex and multifaceted weather dynamics rather than straightforward seasonal influences.
Variability in Wind Direction: Similar to wind speed, wind direction shows high variability with no discernible seasonal trend, indicating that local geographical and meteorological complexities significantly influence wind patterns.
Snowfall Patterns and Density: The snowfall data show a pronounced seasonal pattern, inversely correlated with temperature data. Peaks during the winter months align with the temperature lows, and the density of snow appears to increase with subsequent snowfalls, indicating compaction over time.
Pressure Changes: The average pressure change closely mirrors the temperature change plot, suggesting a strong relationship between atmospheric pressure and temperature, with seasonal factors being a significant driver of these variations.
It is important to keep in mind that the visualizations may not be the optimal method for identifying all types of correlations or patterns within the data. The visual analysis alone might not suffice to confirm the suspected correlation. Wind direction and wind speed is a good example due to the inherent volatility and the vector nature of wind direction data. Advanced statistical tests, like cross-correlation analysis, could be necessary to ascertain the relationships between variables more accurately.
Moving from the broad overview provided by the pie charts, the specifics are now zoomed in on with histograms. These will shed light on the distribution and frequency of weather parameters, offering a deeper understanding that's crucial for forecasting extreme weather conditions. In the next visualization, the corresponding historams are shown, which provide a quantitative view of the distribution of weather data.
def plot_histogram(data, parameter, unit, ax):
# Plot the histogram
sns.histplot(data[parameter], bins=30, kde=True, ax=ax, color='blue')
# Calculate the mean of the parameter
median_value = data[parameter].median()
# Draw a vertical line for the median value
ax.axvline(median_value, color='red', linestyle='dashed', linewidth=2)
# Add a label for the line
ax.text(median_value, ax.get_ylim()[1]*0.9, f'Median: {median_value:.2f}', color='red')
# Set title and axis labels
ax.set_title(f'Histogram for {parameter}')
ax.set_ylabel('Frequency')
ax.set_xlabel(unit)
# Set up the subplots with shared y-axis
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=[20,15])
# Plot histograms for all parameters
for i, (parameter, unit) in enumerate(parameters):
row, col = divmod(i, 3)
plot_histogram(df, parameter, unit, axes[row, col])
# Manually set bottom margin to make x-axis visible
plt.subplots_adjust(bottom=0.1) # Der Wert sollte positiv sein, um die Achsenbeschriftungen sichtbar zu machen.
# Show the plot
plt.suptitle('Bancroft weather parameters histograms', fontsize=24)
plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # Angepasst, um den Titel und die Plots richtig darzustellen
plt.show()
plt.savefig('../output/Bancroft weather parameters histograms.png')
<Figure size 2000x1500 with 0 Axes>
Average Temperature (avg_temp): Histogram: Shows a left-skewed, bimodal distribution. This type of distribution suggests two predominant temperature states, often indicative of seasonal variations as seen with changing weather conditions over the year.
Peak: The plot shows a bimodal distribution, with a distinct peak near 295 Kelvin, indicating a frequently occurring temperature in the dataset, possibly representing a typical average temperature in summer. Another peak is around 275 Kelvin, possibly representing a typical average temperature in winter.
Median: The median value of 279.88 Kelvin is an indicator of the central tendency of the temperature distribution and is lower than the peak, reflecting the left-skewed distribution.
Skewness: The slight tilt to the right despite the left-skewed distribution means that while there are more low temperatures, the data are not heavily skewed towards the lower temperatures.
Dispersion: The seemingly moderate dispersion around the two peaks suggests that the temperatures in these areas are relatively consistent. The bimodal nature of the distribution shows that there are two distinct areas where temperatures most frequently occur. Observations deviating strongly from those peaks might indicate observations with extreme weather events.
Comparison to Time Series: The time series comparison confirms the seasonal patterns observed in the histogram. The data show a recurring tendency towards the same temperatures at certain times of the year, typical for climatic changes through the seasons.
Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dew Point (avg_dewpoint): Histogram: A left-skewed bimodal distribution, similar to that of the average temperature. The bimodality might indicate two dominant states likely reflecting seasonal fluctuations.
Peaks: The peaks lie between 285 and 290 Kelvin, indicating a common range for both variables, possibly an indicator of certain climatic conditions or seasons.
Median: The median values are approximately 274.25 Kelvin for the minimum wet bulb temperature and about 275.43 Kelvin for the average dew point. These values indicate a tendency towards cooler conditions in the data collection.
Dispersion: The moderate dispersions around the peaks suggest some consistency in climatic conditions. There's a clear seasonal dependence, as seen by the regular oscillation of values in the time series.
Comparison to Time Series: The time series plots confirm the bimodal peaks and the seasonal nature of the histograms. There is a clear seasonal oscillation, with peaks in summer and lower values in winter.
Average Temperature Change (avg_temp_change): Histogram: The distribution is symmetrically centered around the zero point, suggesting a bell shape. This implies no significant skewness and that daily temperature changes might be normally distributed.
Peak: The peak at 0 Kelvin suggests that no change in temperature is the most common state. This could point to a stable climate where extreme temperature fluctuations from day to day are not common.
Median: The median at 0.01 Kelvin is nearly zero, confirming the peak. The median reflects the central tendency in a symmetric distribution, here highlighting the stability of temperature changes.
Comparison to Time Series: The time series analysis shows that large daily temperature fluctuations are rare, confirmed by the concentration of data points around the null line in the histogram. The time series plots also show that most temperature changes are minor and oscillate around this null point.
Dispersion: Due to the symmetry of the distribution and the concentration of data near zero, the dispersion appears to be low. This means there are no large unexpected temperature changes, and the temperature remains quite stable on average.
Average Wind Speed (avg_windspd) and Maximum Wind Gust (max_windgust): Histograms: Both histograms exhibit a right-skewed, unimodal distribution. This shape of distribution indicates that lower speed values occur more frequently than high ones.
Peak: The most common value (mode) for average wind speed is around 2 m/s, while that for maximum wind gusts is about 12.5 m/s. These values represent the most frequently occurring conditions for each of the two measurements.
Median: The median for average wind speed is 2.35 m/s and for maximum wind gusts 12.27 m/s. The median is higher than the mode, typical for a right-skewed distribution and indicating that the distribution of data has a larger number of lower speeds.
Skewness: The right skew of both distributions confirms that lower wind speeds and gusts are more common, and higher speeds occur less frequently.
Comparison to Time Series: The time series data show that high wind speeds and gusts occur sporadically and are not a constant state. There are peaks representing high wind events corresponding with the rarer higher values in the right-skewed histogram.
Dispersion: There appears to be a broad dispersion for both parameters, particularly for maximum wind gusts, which occasionally reach peak values. This shows that while low wind speeds are the norm, there is still a considerable number of occasions where the wind becomes strong enough to be classified as a gust.
Maximum Cumulative Precipitation (max_cumulative_precip): Histogram: The histogram shows a strongly right-skewed, unimodal distribution with a pronounced long tail extending to higher values. This means that high precipitation amounts do occur but are much less common than low amounts.
Peak: The mode is near zero, indicating that most observed periods have either no or very little precipitation.
Median: With a median of 1.53 mm, it is indicated that more than half of the precipitation events are below this value. The median is lower than the average due to the right skew.
Skewness: The strong right skew of the histogram confirms that low precipitation amounts are frequent and high amounts rare.
Comparison to Time Series: The time series shows that while high precipitation events occur irregularly, when they do occur, they can be substantial. This corresponds with the long tail of the histogram, indicating rare but extreme precipitation events.
Dispersion: The wide dispersion in the histogram reflects the variability in precipitation events, from very dry periods to rare heavy rainfalls.
Maximum Cumulative Snow (max_cumulative_snow) and Maximum Cumulative Ice (max_cumulative_ice): Histograms: The histograms for both variables show an extremely right-skewed, unimodal distribution, indicating a concentration of values near zero and that high values are very rare.
Peaks: Both histograms have their peaks near zero, indicating that most periods have either no snow or ice, or only very small amounts.
Median: A median of 0 mm for both variables indicates that the majority of observed events accumulate no snow or ice, highlighting the rarity of events with noticeable accumulation.
Skewness: The strong right skew in both histograms shows that while most time periods have no or minimal accumulation, the rare events with snow or ice formation can lead to extremely high values.
Comparison to Time Series: The time series plots confirm that high accumulations of snow and ice are restricted to a few events, typically occurring in the winter months. The moving averages (MAs) from the time series indicate that these high values do not persist and are generally short-lived events.
Maximum Snow Density (max_snow_density_6): Histogram: The histogram shows that nearly all measured values are near zero, indicating that high snow densities are very rare. The few outliers confirm the presence of occasional significant snow density events.
Median: A median of 0 kg/m^3 indicates that the vast majority of observations have no or very low snow density, underscoring the rarity of events with higher snow density.
Skewness: An extremely right-skewed distribution indicates that while most time periods have no or low snow density, the rare events with high snow density can lead to extremely high values.
Dispersion: The dispersion around the zero point appears to be low, indicating that extreme pressure changes are unusual and that atmospheric pressure in this region remains relatively stable.
Comparison to Time Series: The time series plots show little variation in air pressure over time, which is confirmed by the histogram. Most changes are small and there are no long-term trends of increasing or decreasing pressure.
The climate in Bancroft is characterized by a bimodal and seasonal temperature distribution, which indicates significant weather changes throughout the year. The temperature shows no extreme daily fluctuations. Unimodal, low wind speeds are more common than high wind gusts. Precipitation events are mostly low, but there are occasional intense rainfalls. Snow and ice accumulations are rare, but occur sporadically in significant quantities. Atmospheric pressure remains relatively stable, with little long-term fluctuation. Overall, the analysis indicates a climate that is dominated by seasonal patterns and is generally moderate and stable, with a tendency to extreme weather, which includes snow, ice, cold and high winds, which was to be assumed for Canadian climate.
The frequencies of variable values are presented in the following scatterplot. This allows for a quick identification of which variables occur particularly rarely or frequently in a given expression.
def plot_dotplot(data, parameter, unit, ax):
sns.stripplot(x=data[parameter], ax=ax, color='blue', size=8, jitter=True)
ax.set_title(f'Dot Plot for {parameter}')
ax.set_ylabel('Frequency')
ax.set_xlabel(unit)
# Set up the subplots with shared y-axis
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=[20,15])
# Plot dot plots for all parameters
for i, (parameter, unit) in enumerate(parameters):
row, col = divmod(i, 3)
plot_dotplot(df, parameter, unit, axes[row, col])
# Manually set bottom margin to make x-axis visible
plt.subplots_adjust(bottom=-0.1)
# Show the plot
plt.suptitle('Bancroft weather parameters dot plots', fontsize=24)
plt.tight_layout()
plt.show()
plt.savefig('../output/Bancroft weather parameters dot plots.png')
<Figure size 2000x1500 with 0 Axes>
The dot plots for the weather parameters of Bancroft provide a detailed view of data frequency across various metrics. These diagrams are useful for quickly identifying rarer and more frequent events and offer an additional dimension to the understanding of histogram and time series analyses.
Average Temperature (avg_temp): The points are widely scattered across a range of values, indicating a variety of temperatures throughout the year. This corresponds with the bimodal patterns in the histogram and seasonality in the time series.
Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dew Point (avg_dewpoint): The point distributions are similar to average temperature and confirm the bimodal and seasonal fluctuations.
Average Temperature Change (avg_temp_change): The even distribution of points around the zero point indicates that significant daily temperature changes are rare, consistent with the bell-shaped histograms.
Average Wind Speed (avg_windspd) and Maximum Wind Gust (max_windgust): The points concentrate at lower speeds, reflecting the right skew observed in the histogram and the rare peaks in the time series for higher speeds.
Maximum Cumulative Precipitation (max_cumulative_precip): The concentration of points near zero confirms the right skew observed in the histogram and the sporadic peaks in the time series plots for high precipitation events.
Maximum Cumulative Snow (max_cumulative_snow) and Ice Accumulation (max_cumulative_ice): The points are predominantly concentrated at the lower end of the scale, highlighting the rarity of events with higher accumulations, corresponding with the extremely right-skewed histograms.
Maximum Snow Density (max_snow_density_6): The distribution of points near zero suggests that high snow densities are rare, as indicated in the histogram and time series data.
Average Pressure Change (avg_pressure_change): The points are evenly distributed around the zero value, indicating a generally stable pressure environment, which is also confirmed by the histogram and time series plots.
# Set up the subplots with shared x-axis
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 15), constrained_layout=True)
# List of relevant weather parameters
relevant_parameters = ['avg_temp', 'min_wet_bulb_temp', 'avg_dewpoint', 'avg_temp_change', 'avg_windspd', 'max_windgust',
'avg_winddir', 'max_cumulative_precip', 'max_snow_density_6', 'max_cumulative_snow',
'max_cumulative_ice', 'avg_pressure_change']
sns.set(style="darkgrid")
# Loop through each parameter and create a boxplot
for i, parameter in enumerate(relevant_parameters):
sns.boxplot(x=df[parameter], ax=axes.flatten()[i], color='blue')
axes.flatten()[i].set_title(f'Box Plot for {parameter}')
axes.flatten()[i].set_ylabel(parameter)
# Adjust layout
plt.suptitle('Bancroft weather parameters box plots', fontsize=24)
plt.tight_layout()
plt.show()
plt.savefig('../output/Bancroft weather parameters box plots.png')
<Figure size 2000x1500 with 0 Axes>
Average Temperature (avg_temp): The boxplots display a wide distribution with a central tendency around the median, confirming the seasonal fluctuations already visible in the histograms and time series.
Minimum Wet Bulb Temperature (min_wet_bulb_temp) and Average Dewpoint (avg_dewpoint): The boxplots suggest a similar distribution, with a notable range and central tendency, consistent with the seasonal patterns in the other diagrams.
Average Temperature Change (avg_temp_change): The boxplots indicate that most changes are close to zero, suggesting low daily variability, as also seen in the Dot Plots and time series.
Average Wind Speed (avg_windspd) and Maximum Wind Gust (max_windgust): The boxplots show that wind speeds are mostly in the lower range, with some outliers representing the rare higher speeds, as demonstrated in the Dot Plots and histograms.
Maximum Cumulative Precipitation (max_cumulative_precip): The boxplot reveals a concentration of data near zero with some high outliers, confirming the sporadic heavy rains visible in the histograms and time series.
Maximum Snow Density (max_snow_density_6), Maximum Cumulative Snowfall (max_cumulative_snow), Maximum Cumulative Ice Accumulation (max_cumulative_ice): The boxplots illustrate that high values are rare, with many data points clustering near zero, indicating generally low snow and ice accumulation, similar to observations in the Dot Plots and histograms.
Average Pressure Change (avg_pressure_change): The boxplots show a distribution centered around zero with some outliers on both sides, indicating a generally stable atmospheric pressure distribution, consistent with the other types of diagrams.
In the next step the variables are visualized together with their corresponding weather events.
def plot_wep_features_distogram(df, label_col):
f, axes = plt.subplots(4, 3, constrained_layout = True, figsize=[20,15])
sns.set(style="darkgrid")
for i in df[label_col].unique():
df_cluster_distogram = df.loc[df[label_col] == i]
axes = axes.ravel()
num_label = df_cluster_distogram[label_col].value_counts()[i]
for j, feature in enumerate(relevant_parameters):
sns.distplot(df_cluster_distogram[feature], label=f'{i}: {num_label}', ax=axes[j])
f.suptitle("Bancroft Distribution of Weather Features by Weather Event Profiles in Distograms", fontsize=24)
axes[0].set(title="Avg temperature in Kelvin", xlim=(230, 310), xlabel="")
h, l = axes[0].get_legend_handles_labels()
axes[1].set(title="Min wet bulb temperature in Kelvin", xlim=(230, 310), xlabel="")
axes[2].set(title="Avg dewpoint in Kelvin", xlim=(230, 310), xlabel="")
axes[3].set(title="Avg temperature change in Kelvin", xlim=(-1, 1), xlabel="")
axes[4].set(title="Avg wind speed in m/s", xlim=(0, 10), xlabel="")
axes[5].set(title="Max windgust in m/s", xlim=(0, 30), xlabel="")
axes[6].set(title="Avg winddirection in degrees", xlim=(0, 360), xlabel="")
axes[7].set(title="Max cumulative precipitation in mm", xlim=(0, 100), ylim=(0, 0.1), xlabel="")
axes[8].set(title="Max snow density (when snow exceeds 6.35mm in kg/m^3)", xlim=(0, 800), ylim=(0, 0.01), xlabel="")
axes[9].set(title="Max cumulative snow accretion in mm", xlim=(0, 1000), ylim=(0, 0.009), xlabel="")
axes[10].set(title="Max cumulative ice accretion in mm", xlim=(0, 10), ylim=(0, 0.8), xlabel="")
axes[11].set(title="Avg pressure change in Pa", xlim=(-200, 200), xlabel="")
f.legend(h, l, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=5)
data = df.dropna(subset=['wep'], how='any').sort_values('wep')
plot_wep_features_distogram(data, 'wep')
plt.savefig('../output/Bancroft_wep_parameter_distograms.png')
Overview These distograms provide a statistical representation of weather data, illustrating the density and distribution of meteorological features categorized by WEPs. Each plot represents a different weather feature, and the overlay of kernel density estimation curves facilitates the visualization of data distribution trends.
Analysis of Distogram Features Temperature Metrics (Average, Minimum Wet Bulb, Dewpoint in Kelvin): These metrics show multimodal distributions, with peaks representing common temperature ranges. The presence of multiple peaks can indicate the occurrence of distinct weather patterns or seasonal variations.
Temperature Change (Average Change in Kelvin): The spread of data around zero indicates frequent temperature stability, but noticeable deviations suggest periods of significant warming or cooling. Skewness in this distogram would indicate a tendency towards one or the other.
Wind Characteristics (Average Speed, Maximum Gust in m/s): Wind speed distograms are skewed towards lower speeds, with long tails extending to higher speeds, indicating that while calm conditions are common, gusty conditions do arise, albeit less frequently.
Wind Direction (Average Direction in Degrees): The relatively flat distribution across degrees indicates no dominant wind direction, suggesting variable wind patterns across the different WEPs.
Precipitation (Maximum Cumulative in mm): The distribution here is right-skewed, indicating that lower precipitation events are more common, with fewer but significant heavy precipitation events.
Snow Metrics (Maximum Density, Cumulative Snow Accretion in mm): These plots reveal a right-skew, especially for snow density, suggesting that while light snowfall is common, events with dense snow accumulation are rare but significant when they occur.
Ice Accumulation (Maximum Cumulative in mm): A similar right-skewed distribution to snow, indicating that ice accumulation is generally low, with occasional extreme icing events.
Pressure Variability (Average Change in Pa): The distribution of pressure changes appears to be approximately normally distributed, indicating that significant pressure drops or spikes are unusual.
Interpretation and Real-World Implications Temperature Metrics: The various modes in temperature distograms may correspond to different seasonal conditions, with the specific peaks corresponding to typical temperatures for those times of year.
Temperature and Pressure Changes: Large swings in temperature and pressure can be associated with frontal systems, which are critical for weather forecasting and can have significant implications for agriculture and energy usage.
Wind Patterns: Understanding the distribution of wind speed and direction is crucial for aviation, marine navigation, and the operation of wind farms.
Precipitation and Snow/Ice Accumulation: These metrics are vital for water resource management, agriculture, and urban planning, especially in anticipating flood risks or the impact of heavy snowfall on infrastructure.
Skewness and Tail Analysis: The presence of skewness and heavy tails in the distograms for precipitation, snow, and ice accumulation suggests the potential for extreme weather events, which are critical for disaster preparedness and risk management.
Concluding Remarks The distograms offer a nuanced view of the probability density of various weather conditions, providing valuable insights for meteorologists, environmental scientists, and policy makers. The observed multimodality, skewness, and the presence of heavy tails in the distributions underline the complexity of weather dynamics and the importance of preparing for a range of weather conditions, from the most common to the extreme. Quantitative analysis, such as calculating skewness, kurtosis, and conducting hypothesis tests, would further enhance the understanding and interpretation of these weather patterns.
The plots show that some of the variables occuring more often at special weather events. For example on days with high temperatures there is often blue sky. At low temperatures, snowstorms with high percipitiation seem to occur. However, it is important to note that, at this point no statements about causality can be made. The graphic only reveals correlations.
Here, the frequencies of various variables are once again depicted. The points are colored according to weather events. This allows for conclusions to be drawn regarding the occurrence of specific aspects during different weather events.
def plot_wep_features_dotplot(df, label_col, relevant_parameters):
unique_labels = df[label_col].unique()
palette = sns.color_palette()
label_colors = dict(zip(unique_labels, palette))
f, axes = plt.subplots(4, 3, constrained_layout=True, figsize=[20, 15])
sns.set(style="darkgrid")
axes = axes.ravel()
for i in unique_labels:
df_cluster_dotplot = df.loc[df[label_col] == i]
for j, feature in enumerate(relevant_parameters):
sns.stripplot(x=df_cluster_dotplot[feature], color=label_colors[i], ax=axes[j])
f.suptitle("Bancroft Distribution of Weather Features by Weather Event Profiles in Dotplots", fontsize=24)
patches = [plt.Line2D([0], [0], color=label_colors[label], marker='o', linestyle='', markersize=10) for label in unique_labels]
f.legend(patches, unique_labels, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=len(unique_labels))
axes[0].set(title="Avg temperature in Kelvin", xlim=(230, 310), xlabel="")
axes[0].legend(loc="best")
axes[1].set(title="Min wet bulb temperature in Kelvin", xlim=(230, 310), xlabel="")
axes[2].set(title="Avg dewpoint in Kelvin", xlim=(230, 310), xlabel="")
axes[3].set(title="Avg temperature change in Kelvin", xlim=(-1, 1), xlabel="")
axes[4].set(title="Avg wind speed in m/s", xlim=(0, 10), xlabel="")
axes[5].set(title="Max windgust in m/s", xlim=(0, 30), xlabel="")
axes[6].set(title="Avg winddirection in degrees", xlim=(0, 360), xlabel="")
axes[7].set(title="Max cumulative precipitation in mm", xlim=(0, 100), ylim=(0, 0.1), xlabel="")
axes[8].set(title="Max snow density (when snow exceeds 6.35mm in kg/m^3)", xlim=(0, 800), ylim=(0, 0.01), xlabel="")
axes[9].set(title="Max cumulative snow accretion in mm", xlim=(0, 1000), ylim=(0, 0.009), xlabel="")
axes[10].set(title="Max cumulative ice accretion in mm", xlim=(0, 10), ylim=(0, 0.8), xlabel="")
axes[11].set(title="Avg pressure change in Pa", xlim=(-200, 200), xlabel="")
data = df.dropna(subset=['wep'], how='any').sort_values('wep')
plot_wep_features_dotplot(data, 'wep', relevant_parameters)
plt.savefig('../output/Bancroft_wep_parameter_dotplots.png')
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Average Temperature: The distribution clearly shows that days with clear skies (blue) tend to have higher temperatures, while snow days (green, red, brown) usually fall in the cooler temperature range. Snowstorms with heavy precipitation also tend to result in lower temperatures, while storms with freezing rain/heavy snow and ice storms tend to have more moderate temperatures. Furthermore, the distribution shows that clear sky days can also have the lowest temperatures, which may be an initial indicator that weather may only be partially related to temperature.
Minimum Wet Bulb Temperature and Average Dew Point: These parameters follow a similar pattern to average temperature, with lower values often associated with snowfall (green, red, brown), and higher temperatures occurring more frequently on clear days (blue).
Average Temperature Change: The slight variation in temperature change across different weather events suggests a relatively stable thermal environment, regardless of weather conditions.
Average Wind Speed and Maximum Wind Gust: Lower wind speeds seem to occur frequently on clear days (blue), while higher speeds appear to be associated with moderate snowfall or more stormy conditions (gray, pink). Wind gusts show a wide dispersion across various weather events, making it difficult to draw conclusions.
Average Wind Direction: The wide dispersion across all directions appears to show no significant correlation with specific weather events, indicating diverse wind patterns.
Maximum Cumulative Precipitation: The lowest values are associated with moderate rain (purple) and then moderate snowfall. Clear sky days (blue) seem to have the highest values but become increasingly rare as the value increases. However, the highest values are only seen on clear sky days.
Maximum Snow Density: Logically, only snow events are considered (red, brown, green, gray, pink). Moderate snowfall accounts for the lowest values. High snow densities are rare and seem to correlate with snowstorms (gray, green), as can be seen from the isolated points in these colors.
Maximum Cumulative Snowfall and Maximum Cumulative Ice Accumulation: These parameters clearly show that days with heavy snowfall (gray) or freezing rain (orange) lead to higher accumulations, while clear days (blue) and light snowfall (red, brown) show little accumulation.
Average Pressure Change: The distribution of pressure changes shows no clear correlation to specific weather events, suggesting that pressure changes might be relatively stable regardless of the type of weather. However, it can be seen that continuous freezing rain (orange), mild snowfall (red), heavy snowfall with accumulated snow (green), and clear sky days (blue) tend to appear on the edges and thus occur in combination with a high average change in pressure.
def plot_wep_features_boxplot(df, label_col, relevant_parameters):
unique_labels = df[label_col].unique()
palette = sns.color_palette()
label_colors = dict(zip(unique_labels, palette))
f, axes = plt.subplots(4, 3, constrained_layout=True, figsize=[20, 15])
sns.set(style="darkgrid")
for j, feature in enumerate(relevant_parameters):
ax = axes.flatten()[j]
sns.boxplot(x=df[label_col], y=df[feature], ax=ax, palette=label_colors)
ax.set_title(f'Box Plot for {feature}')
ax.set_ylabel(feature)
ax.set_xlabel('')
ax.set_xticks([])
ax.set_xticklabels([])
f.suptitle("Bancroft Distribution of Weather Features by Weather Event Profiles in Boxplots", fontsize=24)
handles = [plt.Line2D([0], [0], color=label_colors[label], marker='o', linestyle='', markersize=10) for label in unique_labels]
f.legend(handles, unique_labels, loc='upper center', bbox_to_anchor=(0.5, -0.05), fancybox=True, shadow=True, ncol=len(unique_labels))
plt.tight_layout()
plt.show()
data = df.dropna(subset=['wep'], how='any').sort_values('wep')
plot_wep_features_boxplot(data, 'wep', relevant_parameters)
plt.savefig('../output/Bancroft_wep_parameter_boxplots.png')
<Figure size 2000x1500 with 0 Axes>
Comprehensive Weather Analysis Through Boxplot Visualization
The suite of boxplots offers an extensive overview of weather features associated with different WEPs, organized into rows and columns, each depicting a distinct meteorological attribute. These boxplots reveal the extensive variability and presence of outliers across temperature, wind, and precipitation features, suggesting a spectrum of conditions from stable to extreme.
Temperature-Related Features
Temperature and Pressure Variations
Wind Features
Precipitation
Snow Density
Relationships Between Boxplots Complex interactions between various weather conditions are suggested, including potential inverse correlations between temperature and precipitation forms, associations between strong winds and temperature fluctuations, and correlations between higher wind speeds and intense precipitation events.
Insights
In the analysis of the present assignment, temperature and wind are of particular importance. These variables will be examined more closely.
# If 'Blue sky day' is present in 'wep', use 'label0'; otherwise, use 'wep'
df['display_label'] = df.apply(lambda row: f"{row['wep']}: '{row['label0']}'" if row['wep'] == 'Blue sky day' else row['wep'], axis=1)
# Count occurrences of weather events
weather_counts = df['display_label'].value_counts()
# Count occurrences of weather events excluding 'Blue sky day'
not_blue_sky_day_counts = df[df['wep'] != 'Blue sky day']['wep'].value_counts()
# Determine events to be exploded
explode_events = set(not_blue_sky_day_counts.index)
# Create explode list for the first chart
explode = [0.1 if event in explode_events else 0 for event in weather_counts.index]
# Total number of all samples
total_samples = len(df)
# Create the plot
plt.figure(figsize=(16, 8))
# Group categories and adjust labels
lower_half_threshold = weather_counts.quantile(0.5)
grouped_labels = [label if weather_counts[label] > lower_half_threshold else 'Other' for label in weather_counts.index]
# Pie chart for all weather events
plt.subplot(1, 2, 1)
patches, texts, autotexts = plt.pie(weather_counts, labels=grouped_labels, autopct=lambda p: f'{p:.1f}%\n({int(p/100*total_samples)})', startangle=90, explode=explode)
plt.title(f'Distribution of All Weather Events\nTotal Number of Observations: {total_samples}')
# Adjust text size and hide overlapping labels
for text, autotext in zip(texts, autotexts):
if text.get_text() == 'Other':
text.set_visible(False)
text.set_fontsize(8)
autotext.set_fontsize(8)
# Enlarged view for 'wep' != 'Blue sky day'
plt.subplot(1, 2, 2)
total_not_blue_samples = not_blue_sky_day_counts.sum()
plt.pie(not_blue_sky_day_counts, labels=not_blue_sky_day_counts.index, autopct=lambda p: f'{p:.1f}%\n({int(p/100*total_not_blue_samples)})', startangle=90)
plt.title(f'Distribution of Extreme Weather Events\nTotal Number of Non-Blue-Sky Observations: {total_not_blue_samples}')
plt.show()
plt.savefig('../output/Distribution of weather events pie chart.png')
<Figure size 2000x1500 with 0 Axes>
The data shows a clear predominance of weather events, with 'Blue sky day' being the majority. However, there is a notable frequency of various forms of precipitation, particularly snowfall, which can have implications for activities such as transportation, agriculture, and city planning. The variance in snow-related conditions from mild to heavy suggests a range of winter weather that would require different levels of response from municipal services.
The first chart details the proportion of different weather events and consists of 65.345 observations. Most observations, about 31.6%, have been classified under a category "Blue sky day: 0", suggesting a predominant weather type in the dataset. 'Blue sky day' occurrences are differentiated into five sub-categories ('0', '1', '2', '3', '4'), which appear with a huge difference in their frequency. Together, they account for more than 77.9 percent of all observations, indicating that clear skies are a common condition. The remaining weather events such as moderate rain, mild and moderate snowfall, and other less frequent events are highlighted, since they show the main part of this EDA.
The second chart focuses on the distribution of extreme weather events, excluding the 'Blue sky day'. This subset consists of 12.712 observations, which is approximately 19.5% of the total dataset, indicating that extreme weather events are relatively less common. Within this subset, moderate snowfall is the most common, making up 35.4% of non-clear sky observations. Moderate snowfall (23.0 %) and moderate rain (20.7 %) are the next most common events. Those three accumulated make up to 79.1 percent, the remaining 20.9 percent are the more extreme events. The more extreme events include:
df_datetime = df_full.copy()
# Grouping the data by year and month and calculating the average in Celsius
grouped_monthly_avg_celsius = df_full.groupby([df_full['run_datetime'].dt.year, df_full['run_datetime'].dt.month])['avg_temp_celsius'].mean()
# Conversion of the grouped data into a format suitable for the radar type
radar_data_celsius = grouped_monthly_avg_celsius.unstack(level=1)
# Selecting the color chart
cmap = plt.cm.coolwarm
# Determining the temperatures in July and scaling for the color chart
july_temps = radar_data_celsius.loc[:, 7] # 7 steht für Juli
min_july_temp = july_temps.min()
max_july_temp = july_temps.max()
scaled_july_temps = (july_temps - min_july_temp) / (max_july_temp - min_july_temp)
colors = [cmap(temp) for temp in scaled_july_temps] # Liste der Farben
# Creating the radar chart with Celsius data
angles_celsius = [n / 12 * 2 * math.pi for n in range(12)] # 12 Monate
angles_celsius += angles_celsius[:1] # Schließt den Kreis
# Month names
month_names = ["Jan", "Feb", "Mär", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez"]
# Creating the radar chart
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(polar=True))
for i, year in enumerate(radar_data_celsius.index):
values_celsius = radar_data_celsius.loc[year].tolist()
values_celsius += values_celsius[:1] # Closing the circle
ax.plot(angles_celsius, values_celsius, linewidth=1, linestyle='solid', label=f'Year {year}', color=colors[i])
# Add month names as labels
plt.xticks(angles_celsius[:-1], month_names, color='grey', size=12)
plt.title('Temperature change over the months of all years in the dataset (in Celsius)', size=20, y=1.1)
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
plt.show()
plt.savefig('../output/Temperature change radar chart.png')
<Figure size 2000x1500 with 0 Axes>
The radar chart visualizes the average temperature data derived from the ERA5 weather dataset for Bancroft over the period 2016 to 2022. Each of the 12 axis represents a month, with temperature change values plotted along the axes. The average temperature was converted from Kelvin to Celsius for better visual representation.
Techniques Utilized:
Key Insights from the Radar Chart:
weather_counts = df['wep'].value_counts()
print(weather_counts)
wep Blue sky day 52633 Mild snowfall 4497 Moderate snowfall 2920 Moderate rain 2630 Heavy snowfall with accumulated snow 1040 Frondurchlauf / Continuous freezing rain 934 Storm with freezing rain / Heavy snow- and icestorm 430 Snowstorm with high precipitation 261 Name: count, dtype: int64
# Convert the 'run_datetime' column to a datetime object
df['run_datetime'] = pd.to_datetime(df['run_datetime'])
# Aggregate the average wind speed and temperature for each month across all years
avg_by_month = df.groupby(df['run_datetime'].dt.to_period('M')).agg({
'avg_windspd': 'mean',
'avg_temp': 'mean'
}).reset_index()
# Ensure the presence of a date column
avg_by_month['Month'] = avg_by_month['run_datetime'].dt.strftime('%b')
# Adjust the order of the months
months_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
avg_by_month['Month'] = pd.Categorical(avg_by_month['Month'], categories=months_order, ordered=True)
avg_by_month = avg_by_month.sort_values('Month')
# Plot as a Line Chart
fig, ax1 = plt.subplots(figsize=(15, 8))
# Line for wind speed on the left y-axis
color = 'tab:blue'
ax1.set_xlabel('Month')
ax1.set_ylabel('Average windspeed (m/s)', color=color)
ax1 = sns.lineplot(x='Month', y='avg_windspd', data=avg_by_month, sort=False, label='Wind Speed', color=color)
ax1.tick_params(axis='y', labelcolor=color)
# Second y-axis for temperature
ax2 = ax1.twinx()
color = 'tab:red'
ax2.set_ylabel('Average temperature (°C)', color=color)
ax2 = sns.lineplot(x='Month', y='avg_temp', data=avg_by_month, sort=False, label='Temperature', color=color)
ax2.tick_params(axis='y', labelcolor=color)
plt.title('Average Wind Speed and Temperature per Month Across All Years')
# Adding the legend manually
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()
plt.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper right')
plt.show()
plt.savefig('../output/verage wind speed and temperature per month across all years.png')
<Figure size 2000x1500 with 0 Axes>
This time series analysis provides insightful revelations into the climate of Bancroft, examining the average monthly temperature and wind speed.
Key Findings Average Temperature
Detailed Analysis Temperature and Wind Speed Intersections
Interpretation and Conclusions
This analysis already brings us to the analysis of relationships between variables, which will be disgussed in the following chapter.
# Display a scatter plot in a diagram with subplots for all parameter pairs
fig, axes = plt.subplots(len(relevant_parameters), len(relevant_parameters), figsize=(50, 50), gridspec_kw={'hspace': 0.8, 'wspace': 0.4})
# Function for calculating the correlation coefficient and displaying the scatterplot
def plot_scatter_and_correlation(ax, x, y, data):
sns.scatterplot(x=x, y=y, data=data, ax=ax)
correlation_coefficient = data[x].corr(data[y])
ax.set_title(f'Scatterplot of {x} and {y}', fontsize=16)
ax.set_xlabel(x, fontsize=14)
ax.set_ylabel(y, fontsize=14)
ax.text(0.5, -0.27, f'Correlation Coefficient: {correlation_coefficient:.2f}', transform=ax.transAxes, ha='center', fontsize=12)
# Loop over all parameters
for i, feature_x in enumerate(relevant_parameters):
for j, feature_y in enumerate(relevant_parameters):
if i != j: # Damit die Diagonale übersprungen wird (x gegen x)
plot_scatter_and_correlation(axes[i, j], feature_x, feature_y, df)
plt.tight_layout()
plt.show()
plt.savefig('../output/Bancroft Association Plots.png')
<Figure size 2000x1500 with 0 Axes>
Exploring Parameter Associations and Correlations This chapter delves into the significance of association and correlation analyses, the methodologies employed, and the nuanced interpretation of results. Additionally, it addresses the rationale behind incorporating temperature and wind parameters in regression analyses, despite their lack of direct correlation.
Significance of Association and Correlation Association and correlation analyses are fundamental tools in uncovering relationships between different variables within a dataset. These analyses provide insights into the strength and direction of connections, helping identify patterns, dependencies, and potential factors influencing the phenomena under study.
To assess the associations and correlations among meteorological parameters, scatterplots were plotted and the correlation coefficients calculated. The scatterplots visually display the relationship between two variables, offering an initial understanding of their potential connection. Subsequently, correlation coefficients (Pearson correlation coefficient) quantify the strength and direction of the linear relationship between variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).
Comparing Parameter Correlations In the exploration, the correlation between all pairs of meteorological parameters were systematically compared. This exhaustive approach aimed to uncover not only direct associations but also potential indirect relationships and dependencies that might influence weather patterns.
A noteworthy observation from the analyses is that variables with similar scales tend to exhibit stronger correlations (e.g., avg_temp, avg_dewpoint, avg_temp_change, min_wet_bulb_temp as well as avg_windspd and max_windgust). This finding underscores the importance of considering the units and magnitudes of variables when interpreting correlation coefficients. Variables measured on comparable scales are more likely to show coherent patterns and relationships. An intriguing aspect of the analysis was the dynamic nature of correlations, particularly in response to extreme weather events. The occurrence of extreme weather events was found to alter the correlation between variables, suggesting that these events could act as influential factors in shaping the relationships among meteorological parameters.
Despite not exhibiting a direct correlation, temperature and wind parameters were intentionally included in the following regression analyses. The decision stemmed from the recognition that their relationship might be nonlinear or influenced by additional factors not captured by linear correlation. The exploration of multiple predictor regression analyses sought to unveil nuanced interactions among these parameters, going beyond simple linear associations. The inclusion of temperature and wind parameters in multiple predictor regression analyses aimed to unravel complex dependencies and potential interactions that might contribute to temperature variations. While standalone correlation might not capture these intricate relationships, the multiple predictor regression framework allows for a more holistic understanding of the combined impact of various factors on the target variable.
Conclusion In conclusion, the association and correlation analyses provided valuable insights into the interconnectedness of meteorological parameters. The consideration of variable scales, the dynamic influence of extreme weather events, and the rationale for including seemingly uncorrelated variables in regression analyses contribute to a more nuanced understanding of atmospheric dynamics in Bancroft. This chapter serves as a foundation for the subsequent exploration of multiple predictor regression analyses, where the goal is to decipher the complex interplay among meteorological parameters and their collective impact on weather patterns.
This analysis should also be conducted seperately for all different weps.
def plot_wind_speed_and_avg_temp_by_direction(df_full, df, wind_direction_label, avg_temp_label, avg_windspd_label):
# Calculate the average temperature for each wind direction
avg_temp_by_direction = df_full.groupby(wind_direction_label)[avg_temp_label].mean().reset_index()
# Order cardinal directions clockwise
directions_ordered = ['North', 'Northeast', 'East', 'Southeast', 'South', 'Southwest', 'West', 'Northwest']
avg_temp_by_direction[wind_direction_label] = pd.Categorical(avg_temp_by_direction[wind_direction_label], categories=directions_ordered, ordered=True)
avg_temp_by_direction = avg_temp_by_direction.sort_values(wind_direction_label)
# Color scale for temperature
color_scale = plt.cm.get_cmap('coolwarm')
# Create the bar chart
fig, ax = plt.subplots(figsize=(8, 6))
# Loop through the data and create bars with corresponding color
for i, row in avg_temp_by_direction.iterrows():
color = color_scale(row[avg_temp_label] / avg_temp_by_direction[avg_temp_label].max())
wind_direction = row[wind_direction_label]
average_wind_speed = df[df[wind_direction_label] == wind_direction][avg_windspd_label].mean()
ax.bar(wind_direction, average_wind_speed, color=color, edgecolor='black')
# Chart settings
ax.set_xlabel('Wind Direction')
ax.set_ylabel('Wind Speed (m/s)')
ax.set_title('Wind Speeds and Average Temperatures by Wind Direction')
# Create a separate Axes for the colorbar
cax = fig.add_axes([0.95, 0.1, 0.04, 0.79])
# Add color legend
temperature_min = avg_temp_by_direction[avg_temp_label].min()
temperature_max = avg_temp_by_direction[avg_temp_label].max()
sm = plt.cm.ScalarMappable(cmap=color_scale, norm=plt.Normalize(vmin=temperature_min, vmax=temperature_max))
sm.set_array([]) # empty data for the color legend
cbar = plt.colorbar(sm, cax=cax, orientation='vertical')
cbar.set_label('Average Temperature (°C)')
# Show the chart
plt.show()
# Beispielaufruf der Funktion
plot_wind_speed_and_avg_temp_by_direction(df_full, df, 'wind_direction_label', 'avg_temp_celsius', 'avg_windspd')
plt.savefig('../output/Bancroft Wind Speeds and Average Temperatures by Wind Direction.png')
<Figure size 2000x1500 with 0 Axes>
Wind Speed The bars height represent the average wind speed for each wind direction. The highest wind speeds are observed from northern and western directions. The lowest wind speeds occur with southeastern and eastern winds.
Average Temperature The color of the bars represents the average temperature, with warmer temperatures in darker shades of red and cooler temperatures in darker shades of blue. Winds from the southwest bring the highest temperatures. Northern winds are associated with the lowest temperatures.
Interpretation Wind speed varies with wind direction. Northern and western winds bring higher wind speeds, which may indicate stronger wind events or a general tendency for higher wind speeds from these directions. The color coding of the bars shows that southwestern winds tend to correlate with warmer temperatures, while northern winds bring cooler temperatures.
This suggests the origin of the air masses: warmer air masses from the southwest and colder ones from the north. The higher wind speeds from northern and western directions might indicate more frequent and stronger wind events in these directions. Warmer temperatures with southwestern winds could suggest warmer air masses being transported from these directions. The cooler temperatures with northern winds might reflect colder air masses originating from polar regions.
Conclusions The data show that wind direction has a significant impact on wind speed and temperature. These findings can be important for weather forecasting and sectors like energy production, where wind energy and temperature management are crucial factors. Further analyses might focus on how these patterns vary over time and what impact they may have on local climate conditions. Concluding from this analysis, it seems like wind and temperature are correlated or even having a causal connection, as southern winds are milder and warmer, while northern winds are stronger and colder. This hypotheses is to be evaluated in the further progress of this work.
Building upon the comprehensive analysis of weather events and their directional influences, a transition is now made to a more abstract yet insightfuThe multi-dimensional weather data will be condensed into three principal components, providing a visual exploration of the intrinsic structure and variability of the data. By plotting this 3D PCA scatter plot, hidden patterns, clusters, or anomalies across the weather events are anticipated to be uncovered. This approach aims to offer a more profound understanding of the complex interplay between various meteorological elements. realm of data visualization.— a 3D space facilitated by Principal Component Analysis (PCA). This advanced method is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information. Let's venture into this 3D representation to discern the subtle yet critical aspects of weather dynamics that might elude simpler analyses.
import plotly.express as px
from sklearn.decomposition import PCA
feature_columns = df.columns
columns_to_exclude = ['wep', 'run_datetime', 'avg_temp_celsius', 'avg_winddir', 'wind_direction_label', 'label0', 'label1', 'label2', 'year', 'month', 'display_label']
feature_columns = [col for col in feature_columns if col not in columns_to_exclude]
X = df[feature_columns].dropna()
X_standardized = StandardScaler().fit_transform(X)
# Apply PCA to reduce dimensionality to 3 components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_standardized)
# Create a DataFrame with PCA results and 'wep' label
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2', 'PC3'])
pca_df['wep'] = df.loc[X.index, 'wep']
# Create an interactive 3D scatter plot using Plotly Express with smaller dots
fig = px.scatter_3d(pca_df, x='PC1', y='PC2', z='PC3', color='wep', symbol='wep', size_max=2, opacity=0.7,
labels={'wep': 'Weather'}, title='Interactive 3D PCA Plot of Weather Data')
# Update layout for better visualization
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
# Update marker size for better visibility
fig.update_traces(marker=dict(size=2),
selector=dict(mode='markers'))
# Increase marker size in the legend
fig.update_layout(legend=dict(
itemsizing='constant'
))
# Show the plot
fig.show()
The PCA analysis of the weather dataset has provided a detailed 3D model, revealing a non-uniform distribution of data points with a pronounced cone-like shape spreading primarily along the PC1 and PC2 axes, indicating that these components capture the majority of variability within the weather conditions. The range of PC1, extending from -5 to 15, and PC2, from -5 to 17.5, underscores the significant spread of certain weather events. Notably, 'Storm with freezing rain / Heavy snow- and icestorm' demonstrates an exceptional reach beyond 10 on PC1 and over 15 on PC2, marking it as an outlier, while 'Blue sky day' is distinguished by its extension below 0 on PC1 and -5 to 10 on PC2. These conditions, along with 'Mild snowfall', which also appears below 0 on PC2, highlight the wide variation in meteorological features and their considerable impact on the PCA model.
The dense clustering around the origin suggests that the most common weather patterns share a high degree of overlap in key features such as average temperature and precipitation. This dense core, contrasted with the extended tail and outliers in the PCA plots, suggests a gradient from frequent to rare and extreme weather events, with the latter possibly driven by extreme values in features like wind speeds or temperature changes.
However, interpreting this 3D PCA model is not without challenges. Separating clusters, especially within PC3, which has a smaller range from -8 to 8, is difficult due to the subtlety that this component captures, potentially related to less dominant weather features. The considerable overlap of conditions within PC1 further complicates this task, demonstrating the complexity of distinguishing between different weather patterns.
The distinct spread and visibility of events such as 'Storm with freezing rain / Heavy snow- and icestorm' and 'Blue sky day' at first glance provide valuable insights into the extremities of the weather dataset. This visibility serves as an essential starting point for further investigation, possibly guiding targeted research into the factors contributing to these extreme phenomena.
First, an overview of various regression and classification algorithms is given. Then three different algorithms will be evaluated for each research question in order to investigate their quality in relation to the research question.
In general, different types of algorithms have varying use cases, thus necessitating consideration of the application scenarios.
In weather data analysis, the choice between classification and regression depends on the nature of the target variable. Classification is better suited for predicting categorical outcomes, such as weather events. The distinct boundaries between these events and the intuitive interpretation of classification results make it particularly suitable for this purpose. Classification is also adept at handling multiple classes and imbalanced data, common characteristics in weather event datasets. The actionable insights provided by classification outcomes are valuable in decision-making scenarios.
On the other hand, regression is more appropriate for predicting continuous variables, such as temperature. Temperature changes often occur gradually and exhibit a numerical gradient, making regression well-equipped to capture these subtle variations. Regression models are effective in scenarios where precise numerical predictions are essential, providing a quantitative representation of temperature values. Additionally, regression models, particularly those designed for time-series analysis, can account for temporal patterns commonly found in temperature data.
In summary, while classification excels in providing clear categorizations of weather events, regression is better suited for the nuanced prediction of continuous variables like temperature, offering precise numerical estimations and accommodating subtle variations over time or space.
Linear Regression: Linear Regression Models describe a linear relationship between a dependent and on or more independent variables. The least squared method is used to fit a straight line to represent the response variable [^24].
Logistic Regression: Logistic Regression models are used for classification. They offer simplicity and interpretability, which make them ideal for straightforward binary classification tasks. They perform well in scenarios with smaller datasets and maintain efficiency while being less prone to overfitting. Their effectiveness diminishes when faced with complex, non-linear relationships within the data [^25].
Decision Trees and Random Forests: Decision Trees or Regression Trees and Random Forests as an enhancement can both be used for regression and classification. They can be useful for use-cases with non-linear relationships. They are robust to outliers, but prone to overfitting [^26]. Decision trees are useful for non-linear relationships, with sets of easy definable rules (if a and b happens, but c is not given, d will happen). Decision trees prone to overfitting and are therefore not well-suited for very complex use cases [^27]. Random Forests, an asemble of multiple decision trees, are easier to overcome overfitting and parameters can be set easily. Also, the variable importance and accuracy is generated automatically. Random Forests are therfore better suited for modelling high dimensional data [^28].
Gradient Boosting (e.g., XGBoost, LightGBM): Gradient Boosting uses an ensemble of machine learning models and can reach a high accuracy. An initital Model is fitted and on top of that another model precits the accuracy of the first model. This process is called boosting and leads to a highly accurate model, which is very robust to outliers. The disadvantages are, that they are computionally expensive and the models need hyperparamter tuning [^29].
Support Vector Regression (SVR) / Support Vector Machines (SVM):
Support Vector Regression (SVR) and Support Vector Machines (SVM) excel when there is a clear separation between classes. They are effective in high-dimensional spaces, and perform well when dimensions outnumber samples while beeing also relatively memory-efficient. Handling large datasets and performing when faced with noisy data or overlapping target classes are weeknesses of SVMs. Additionally, if the number of features surpasses the number of training data samples, SVM may underperform [^30].
K-Nearest Neighbors (KNN):
The classification algorithm K-Nearest Neighbors (KNN) is especially known for their easy implementation method and not needing a training period, which makes them a suitable choice for straightforward tasks. Sensitive to noisy data, KNN's performance is impaired by large datasets with high dimensionality [^31].
Naive Bayes:
Naive Bayes is referred as the most efficient and effective clasification algorithm [^32].
Praised for its simplicity and computational efficiency, making it a quick choice for certain applications. It handles high-dimensional data effectively but assumes feature independence, impacting its performance with correlated features.
Neural Networks (Deep Learning):
Neural Networks, are adept at capturing intricate patterns in data, making them well-suited for complex tasks. They are well suited for handling large and intricate datasets but come with the trade-offs of requiring substantial amounts of data and being computationally expensive. The black-box nature of neural networks can also limit their interpretability. However, the application of Neural Networks is out of scope for this analysis[^33].
Predicting weather-related data, such as the ERA5 dataset, poses various challenges due to the complex and dynamic nature of atmospheric phenomena.
One key challenge lies in the inherent temporal dependencies within the data. Time series data, such as weather data over a period of time often exhibits temporal dependencies, where observations at one point are correlated with those at nearby points. A strategy for solving this challenge could be incorporating lagged variables as features in order to capture temporal patterns. By utilizing past values of relevant variables at specific time intervals, the model can effectively capture temporal patterns and dependencies, enhancing its predictive capabilities.
Another relevant aspect is, that the dataset has seasonality and trends that affect the target variable. Weather patterns often demonstrate cyclic variations and underlying trends that can significantly impact the target variable. To mitigate this, detrending and deseasonalizing techniques become imperative. Through methods such as differencing or seasonal decomposition, the data can be processed to emphasize the underlying patterns while minimizing the influence of seasonal fluctuations, thus allowing the model to focus on essential predictive features.
Furthermore, the imbalanced distribution of weather classes within the dataset poses a practical challenge for classification tasks. The occurrence of certain weather conditions are naturally more infrequent than others, leading to imbalances in the classes. Addressing this challenge involves employing techniques like oversampling, undersampling, or synthetic data generation to rebalance the classes. Tuning classification algorithms to account for class imbalances is crucial for ensuring accurate predictions across all weather categories. By systematically addressing these challenges, a robust framework for predicting weather-related variables in the ERA5 dataset can be established, contributing to more accurate and reliable forecasting models.
#import dependencies for models (import2)
from sklearn import svm
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LinearRegression, LogisticRegression, SGDRegressor
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import statsmodels.api as sm
import pickle
Is it possible to find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques? This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather.
The correlation or causation between the temperature and windspeed, windgust or winddirection will be examined through regression analysis. This involves investigating the impact of independent variables (wind features) on the temperature. The result of this regression analysis is the identification of the correlation or causation between wind features and the temperature based on the historical weather.
The first step is to instantiate the selected models. A variety of regression techniques will be used including Linear Regression, Gradient Boosting Regressor, Stochastic Gradient Descent Regressor, and Support Vector Regressor. Each model uses a unique approach to learn the relationship between temperature and wind characteristics.
# Create Linear Regression model
lr_model_1 = LinearRegression()
# Create Gradient Boosting Regressor
gb_model_1 = GradientBoostingRegressor(n_estimators=100, random_state=42)
# Create Stochastic Gradient Descent Regressor
sgd_model_1 = SGDRegressor(max_iter=1000, random_state=42)
# Create Support Vector Regressor
svr_model_1 = SVR()
After the training models are instantiated, the next step is to prepare the dataset for training and validation. This involves selecting the features (independent variables) and the dependent variable (temperature), and then splitting the dataset into training and test sets. This process ensures that the models can learn from a portion of the data (training set) and then be validated against an unseen portion (test set) to assess their predictive performance and generalization capability.
# Select the features (independent variable)
X = df['avg_windspd'].values.reshape(-1, 1)
# Select the dependent variable
y = df['avg_temp']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shapes of the sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
X_train shape: (52276, 1) X_test shape: (13069, 1) y_train shape: (52276,) y_test shape: (13069,)
After selecting the features (average wind speed) as the independent variable and temperature as the dependent variable, the dataset is divided into training and testing sets using a 80-20 split. This means 80% of the data will be used for training the model, and the remaining 20% will be used for testing. The shapes of the resulting sets are as follows:
X_train (features for training) contains 52,276 samples, each with 1 feature (average wind speed).X_test (features for testing) contains 13,069 samples, again each with 1 feature.y_train (temperature for training) contains 52,276 samples corresponding to the temperatures.y_test (temperature for testing) contains 13,069 samples corresponding to the temperatures.The next step is the fitting, which is the actual training of the models on the prepared data. Each model is fitted separately using the training set.
# fit Linear Regression model
lr_model_fit_1 = lr_model_1.fit(X_train, y_train)
# fit Gradient Boosting Regressor
gb_model_fit_1 = gb_model_1.fit(X_train, y_train)
# fit Stochastic Gradient Descent Regressor
sgd_model_fit_1 = sgd_model_1.fit(X_train, y_train)
# fit Support Vector Regressor
svr_model_fit_1 = svr_model_1.fit(X_train, y_train)
Each model is now trained and ready to make predictions. The subsequent step is typically the evaluation, which involves testing the models on unseen data to assess their performance. This step is crucial in determining how well each model generalizes to new data, which is vital for real-world applications. It's also important to compare the models against each other to understand their respective strengths and weaknesses, ensuring the selection of the most appropriate model for predicting temperature based on wind characteristics. This comparative analysis will guide the final decision on which model to deploy for effective and reliable predictions.
Next, a method for evaluation is defined, and the models are then assessed based on this approach.
# Function to evaluate and print model performance
def evaluate_model(model_name, model, X_test, y_test):
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f'{model_name} Model:')
print(f'Mean Squared Error: {mse:.2f}')
print(f'Mean Absolute Error: {mae:.2f}')
print()
# Evaluate Linear Regression model
evaluate_model('Linear Regression', lr_model_fit_1, X_test, y_test)
# Evaluate Gradient Boosting model
evaluate_model('Gradient Boosting', gb_model_fit_1, X_test, y_test)
# Evaluate Stochastic Gradient Descent model
evaluate_model('Stochastic Gradient Descent', sgd_model_fit_1, X_test, y_test)
# Evaluate Support Vector Regression model
evaluate_model('Support Vector Regression', svr_model_fit_1, X_test, y_test)
Linear Regression Model: Mean Squared Error: 127.78 Mean Absolute Error: 9.63 Gradient Boosting Model: Mean Squared Error: 127.72 Mean Absolute Error: 9.63 Stochastic Gradient Descent Model: Mean Squared Error: 127.78 Mean Absolute Error: 9.63 Support Vector Regression Model: Mean Squared Error: 129.85 Mean Absolute Error: 9.57
The models have been evaluated on the test set using two metrics: Mean Squared Error (MSE) and Mean Absolute Error (MAE). These metrics provide a quantitative measure of each model's accuracy in predicting temperature based on wind speed. Lower values indicate better performance. Here are the results:
The results indicate that while all models perform similarly, there are slight variations in their accuracy. This nuanced understanding of each model's performance will guide the final decision on the most appropriate model for deployment.
#### 5.3.1.5 Regression Analysis Temperature and Wind Modeling: Visualize results
Transitioning from model evaluation to visualization is crucial, as it provides an intuitive understanding of model performance. Visualization facilitates the immediate identification of patterns, outliers, and the overall fit of the models through graphical representation of actual versus predicted values and the distribution of residuals.
from scipy.stats import zscore
# Function to create subplots for model visualization and outlier analysis
def visualize_model(model_name, model, X_test, y_test, subplot_idx):
y_pred = model.predict(X_test)
# Calculate residuals
residuals = y_test - y_pred
# Calculate MSE and MAE
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
# Detect outliers using Z-scores
z_scores = zscore(residuals)
outliers = np.abs(z_scores) > 2 # Threshold for considering a point as an outlier
# Create subplots for actual vs predicted values
plt.subplot(2, 4, subplot_idx)
plt.scatter(X_test[outliers], y_test[outliers], color='orange', label='Outliers')
plt.scatter(X_test[~outliers], y_test[~outliers], color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted', alpha=0.7)
plt.plot(X_test, y_pred, color='green', linewidth=2)
plt.title(f'{model_name} Model\nMSE: {mse:.2f}, MAE: {mae:.2f}')
plt.xlabel('Average Wind Speed')
plt.ylabel('Average Temperature')
# Create subplots for residual plots
plt.subplot(2, 4, subplot_idx + 4)
plt.scatter(X_test[outliers], residuals[outliers], color='orange', label='Outliers')
plt.scatter(X_test[~outliers], residuals[~outliers], color='purple')
plt.axhline(y=0, color='black', linestyle='--', linewidth=2)
plt.title(f'Residuals\nMean Residual: {np.mean(residuals):.2f}')
plt.xlabel('Average Wind Speed')
plt.ylabel('Residuals')
# Create subplots
plt.figure(figsize=(16, 8))
# Visualize Linear Regression model with outliers
visualize_model('Linear Regression', lr_model_fit_1, X_test, y_test, 1)
# Visualize Gradient Boosting model with outliers
visualize_model('Gradient Boosting', gb_model_fit_1, X_test, y_test, 2)
# Visualize Stochastic Gradient Descent model with outliers
visualize_model('Stochastic Gradient Descent', sgd_model_fit_1, X_test, y_test, 3)
# Visualize Support Vector Regression model with outliers
visualize_model('Support Vector Regression', svr_model_fit_1, X_test, y_test, 4)
# Adjust layout for better visualization
plt.tight_layout()
# Create a single legend at the top of the entire plot
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, shadow=True, ncol=4)
plt.show()
plt.savefig('../output/Linear Regression Analysis Temperature and Wind Modeling Results.png')
<Figure size 2000x1500 with 0 Axes>
The plots present a comparative analysis of the models, applied to predict average temperature from average wind speed. The upper plots display the actual versus predicted temperatures, and the lower plots show the residuals (differences between actual and predicted values) for each model.
For all models, the distribution of data points indicates a weak relationship between wind speed and temperature, as shown by the dispersed and broadly scattered plots. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) are relatively high across all models, suggesting that the models are not accurately predicting temperature based on wind speed alone. The presence of numerous outliers, particularly evident in the residual plots where they are marked in orange, further signifies that the models' predictions deviate significantly from the actual values for many observations.
The residual plots are centered around the zero line, which is expected, but the wide spread of residuals indicates that the prediction errors are substantial. The mean residuals are close to zero for the Linear Regression, Gradient Boosting, and Stochastic Gradient Descent models, showing no bias in over or underestimation. However, the Support Vector Regression model has a negative mean residual, indicating a slight tendency to underpredict temperatures.
The results align with the preliminary analysis, suggesting that while wind speed does have some relationship with temperature, it is not a strong predictor on its own. The complex nature of temperature patterns, influenced by various climatic factors, cannot be captured through a simple regression with wind speed as the sole predictor. This justifies the need for a multiple regression analysis incorporating additional variables that affect temperature.
The reference to a subsequent picture showing the results for a linear regression analysis between wind speed and wind gust implies that in cases where two variables are intrinsically linked and exhibit a clear, linear relationship, simple regression can yield low MSE and MAE values, indicative of a successful model fit. However, such an analysis may be of limited scientific interest when the variables are inherently correlated and do not provide unique or insightful information about the system being studied.
The visualized results proof, that the temperature variable can hardly be determined by the wind speed variable. The MSE and MAE show very bad values and the results contain a lot of outliers. This was to be expected after the EDA in the previous chapters, as temperature and wind speed are influenced by completely different climate phenomena. Also the time series analysis showed a big difference of the visualisation of the temperature and wind variables. E.g., while there was a profound seasonal pattern and a bimodal distribution in the temperature, none of this was true for the wind paramaters. Also the association plots proofed the same. To predict the temperature more accurately, a multiple regression analysis, which uses multiple predictor variables should be used.
The following picture 2 shows the results for a linear regression analysis for windspeed and windgust, which was experimentively created by a previous iteration (by simply switching the relevant parameters in the python code). Without discussing the results of this analysis or the visualization further, the picture is supposed to show how a good and succesfull linear regression analysis should look like (linear distribution of the dataplots, small MSE and MAE). From a scientific view, these results are barely interesting tho, since the windspeed and windgust parameters are from nature very similar, do correlate and even have a causal connection, since they derrive from the same climate phenommena. Predicting one by the other seems to be redundant, as they are also measured by the same sensors.
The following analysis is focusing on multiple regressor analysis, to improve the prediction of the temperature feature.
Picture 2: Linear Regression Analysis Windspeed and Windgust Results (Created by previous experimentive analysis iteration)
Saving the trained models with timestamped filenames allows for organized storage and easy retrieval for subsequent analysis or deployment.
# Create a meaningful name with a timestamp
model_name_lr_1 = f"lr_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_gb_1 = f"gb_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_sgd_1 = f"sgd_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_svr_1 = f"svr_model_fit_1_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name_lr_1, 'wb') as f:
pickle.dump(lr_model_fit_1, f)
with open(folder_path + model_name_gb_1, 'wb') as f:
pickle.dump(gb_model_fit_1, f)
with open(folder_path + model_name_sgd_1, 'wb') as f:
pickle.dump(sgd_model_fit_1, f)
with open(folder_path + model_name_svr_1, 'wb') as f:
pickle.dump(svr_model_fit_1, f)
In this chapter linear regression with multiple predictor variables is used to predict the temperature variable with an improved accuracy. The analysis uses techniques for multivariate temperature prediction / linear regression of temperature with multiple predictors. It is supposed to answer the research question: How does the incorporation of multiple atmospheric predictors, such as windspeed, windgust, winddirection, air pressure, snow and ice parameters enhance the accuracy of temperature prediction compared to a model solely based on windspeed? This question delves into the complex interplay of various atmospheric variables and their combined effect on temperature changes over different time scales (daytime, day, season, etc.). The outcome of this analysis will contribute to a nuanced model that considers a spectrum of atmospheric conditions for more precise temperature forecasting. This tasks also includes the regression model to uncover interactions and synergies among different atmospheric predictors in influencing temperature variations. This question seeks to identify potential non-linear relationships or combined effects of multiple predictors on temperature. Understanding these interactions is crucial for refining the predictive model and gaining insights into the intricate dynamics of the atmosphere. For this tasks foreward selection and backward deletion methods will be used. Furthermore temporal dynamics are to be analyzed. How does the temporal aspect, including diurnal patterns and seasonal variations, impact the relationship between temperature and multiple predictors? This question aims to capture the temporal dynamics of the atmosphere and assess how the predictive power of the model varies across different timeframes. Examining temporal patterns will contribute to a more nuanced understanding of how atmospheric predictors influence temperature fluctuations under varying conditions.
5.3.2.1.1 Check for Multicollinearity - Correlation Matrix
If two independent variables exhibit a high degree of correlation, it gives rise to a phenomenon known as multicollinearity, posing challenges for analysis and interpretation.
To assess potential multicollinearity, the initial step involves examining correlation coefficients for each pair of continuous (scale) variables. These will be displayed in a correlation matrix. A correlation matrix is a statistical tool that provides a comprehensive view of the relationships between multiple variables in a dataset. It is a square matrix where each entry represents the correlation coefficient between two variables, indicating the strength and direction of their linear relationship.
The correlation coefficient ranges from -1 to 1:
Correlations of 0.8 or higher indicate a robust relationship, implying that only one of the two variables is necessary for regression analysis.
A positive correlation suggests that as one variable increases, the other tends to increase as well, while a negative correlation indicates an inverse relationship. This information is valuable for understanding the interplay among different factors, aiding in the interpretation of data, and informing decision-making processes. Additionally, the correlation matrix is a crucial tool in the field of feature selection, helping to identify and retain only the most relevant variables in predictive modeling and analysis.
#drop columns with datatype object
df_feature_selection = df.drop(columns=['run_datetime','wep','wind_direction_label','label2','display_label'])
# Calculate the correlation matrix
correlation_matrix = df_feature_selection.corr()
# Create a heatmap for the correlation matrix
plt.figure(figsize=(20, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Variables')
plt.show()
plt.savefig('../output/Linear Regression Analysis Multiple Predictors Correlation Matrix of Variables.png')
<Figure size 2000x1500 with 0 Axes>
The heatmap represents a correlation matrix for various weather-related variables. In this matrix, the color intensity reflects the strength of the correlation between any two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no correlation.
Highly Correlated Temperature Variables: There's a block of variables, including 'avg_temp', 'avg_temp_celsius', 'min_wet_bulb_temp', and 'avg_dewpoint', which are all very highly correlated to each other (correlation coefficients above 0.97). This suggests that they are essentially providing the same information about the temperature.
Wind Variables: 'avg_windspd' and 'max_windgust' are moderately correlated (0.85), which is expected as gusts are generally related to average wind speeds. However, their correlations with temperature variables are low, indicating that they may not be strong predictors of temperature.
Directional Wind Variables: 'avg_winddir' and its sine/cosine transformations ('avg_winddir_sin', 'avg_winddir_cos') have very low correlations with other variables, implying they have little or no linear relationship with temperature and other meteorological factors.
Precipitation Variables: 'max_cumulative_precip', 'max_snow_density_6', and 'max_cumulative_snow' show moderate to high correlations with each other. 'max_snow_density_6', and 'max_cumulative_snow' are negatively correlated with temperature variables, which fits the expectation that snow are associated with lower temperatures.
Pressure Change: 'avg_pressure_change' shows a moderate negative correlation with 'avg_temp' and is highly negatively correlated with 'label1'.
Labels: 'label0' and 'label1' are designed to be highly negatively correlated (-0.77). Their correlation with temperature is medium, suggesting that they may be significant for temperature prediction. The negative correlation of 'label0' with temperature suggests it is inversely related to temperature, whereas 'label1' has a positive correlation.
Given the set threshold for correlation (0.8), the process suggests dropping 'avg_temp_celsius', 'min_wet_bulb_temp', and 'avg_dewpoint' due to their redundancy with 'avg_temp', which is the primary temperature measure. The exclusion of 'max_windgust' alongside 'avg_windspd' may be due to their collinearity, while 'avg_temp_change' is removed for being derivative of the temperature itself.
The removal of 'avg_pressure_change' is not depicted in this correlation matrix but is mentioned as a data consistency measure, aligning with the need to have complete data across the temporal scope of the study.
Lastly, 'label0' is considered for exclusion to avoid potential bias and overfitting, given its strong correlation with the target variable. This approach is consistent with practices aimed at improving model generalizability and accuracy.
In the feature selection process, several variables were dropped from the dataset due to various considerations.
Setting the correlation threshold to 0.8 ,the variables 'avg_temp_celsius', 'min_wet_bulb_temp', 'avg_dewpoint' and 'max_windgust' should be excluded because they exhibit high collinearity, indicating redundant information in the model. This step is essential to enhance the model's interpretability and avoid multicollinearity issues.
Additionally, 'avg_temp_change' is removed as it is obviously highly correlated to the temperature.
Additionally, the variable 'avg_pressure_change' was removed as it lacked data for the year 2022, making it inconsistent with the temporal scope of the analysis. Maintaining data consistency is crucial for the reliability of the model.
Furthermore, 'label0' was dropped from the dataset because it demonstrated a strong correlation with the target variable. Including such highly correlated variables may introduce bias and hinder the model's ability to generalize to new data. Therefore, the decision to exclude 'label0' aims to improve the model's predictive accuracy and reduce overfitting
#set "avg_temp_change" to 0 in the first row, as there is no temp change
df_feature_selection.at[0, 'avg_temp_change'] = 0
#drop columns, which are not numeric and not needed
df_feature_selection = df_feature_selection.drop(columns = ['min_wet_bulb_temp', 'avg_temp_change', 'max_windgust' , 'avg_dewpoint' , 'avg_temp_celsius','avg_pressure_change','label0'] )
#show adjusted df
df_feature_selection.head()
| avg_temp | avg_windspd | avg_winddir | avg_winddir_sin | avg_winddir_cos | max_cumulative_precip | max_snow_density_6 | max_cumulative_snow | max_cumulative_ice | label1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 287.389224 | 3.386380 | 80.302464 | -0.981653 | 0.190676 | 2.009 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 287.378997 | 3.326687 | 76.866373 | 0.994736 | 0.102466 | 1.209 | 0.0 | 0.0 | 0.0 | 1 |
| 2 | 287.388845 | 3.243494 | 76.258867 | 0.758262 | 0.651950 | 0.400 | 0.0 | 0.0 | 0.0 | 1 |
| 3 | 287.427324 | 3.145505 | 78.299616 | 0.237897 | -0.971290 | 0.000 | 0.0 | 0.0 | 0.0 | 1 |
| 4 | 287.489158 | 3.047607 | 84.632852 | 0.189006 | -0.981976 | 0.000 | 0.0 | 0.0 | 0.0 | 1 |
5.3.2.1.2 Feature Forward Selection:
Moving into the feature selection phase, the process shifts focus to refining the predictor variables to enhance the model's predictive power. This involves identifying the most significant features through forward selection, a methodical approach to building the regression model by adding variables sequentially based on their statistical significance and contribution to the model's overall explanatory power.
# Extracting the target variable and predictor variables
y = df_feature_selection['label1']
#drop target variable and strongly correlated varaibles
X = df_feature_selection.drop(columns=['label1'])
# Initializing an empty list to store the selected variables
selected_vars = []
# Initializing an empty list to store adjusted R-squared scores for each step
adjusted_r2_scores = []
# Performing forward selection
for i in range(len(X.columns)):
best_score = -1 # Initialize with a value that will be replaced on the first iteration
best_var = None
# Iterate through remaining variables
for var in X.columns:
if var not in selected_vars:
# Include the variable in the model
current_vars = selected_vars + [var]
X_current = X[current_vars]
X_current = sm.add_constant(X_current) # Adding a constant for the intercept
# Fit the model
model = sm.OLS(y, X_current).fit()
current_score = model.rsquared_adj
# Update the best variable if the score is higher
if current_score > best_score:
best_score = current_score
best_var = var
# Update selected variables and store results
selected_vars.append(best_var)
adjusted_r2_scores.append(best_score)
print(f"Step {i + 1}: Added {best_var}, Adjusted R-squared: {best_score:.4f}")
# Display the results
results_df = pd.DataFrame({'Variable': selected_vars, 'Adjusted R-squared': adjusted_r2_scores})
print("\nFinal Forward Selection Results:")
print(results_df)
Step 1: Added max_snow_density_6, Adjusted R-squared: 0.4062
Step 2: Added avg_temp, Adjusted R-squared: 0.5109
Step 3: Added max_cumulative_snow, Adjusted R-squared: 0.5506
Step 4: Added max_cumulative_ice, Adjusted R-squared: 0.5527
Step 5: Added avg_windspd, Adjusted R-squared: 0.5541
Step 6: Added avg_winddir, Adjusted R-squared: 0.5551
Step 7: Added max_cumulative_precip, Adjusted R-squared: 0.5556
Step 8: Added avg_winddir_sin, Adjusted R-squared: 0.5556
Step 9: Added avg_winddir_cos, Adjusted R-squared: 0.5556
Final Forward Selection Results:
Variable Adjusted R-squared
0 max_snow_density_6 0.406223
1 avg_temp 0.510857
2 max_cumulative_snow 0.550552
3 max_cumulative_ice 0.552674
4 avg_windspd 0.554109
5 avg_winddir 0.555149
6 max_cumulative_precip 0.555629
7 avg_winddir_sin 0.555632
8 avg_winddir_cos 0.555626
5.3.2.1.3 Backward Feature Elimination:
Next, the focus will shift to optimizing the feature set using a backward feature elimination approach. By training a RandomForestClassifier with all the identified relevant features, an initial accuracy benchmark will be established. Then, through an iterative process, features will be selectively removed one by one, reassessing the model's accuracy with each reduction. The goal is to determine the smallest set of features that maintain or improve the initial accuracy, thereby streamlining the model for efficiency while preserving its predictive strength. This method aims to reduce complexity, potentially improve performance, and offer insights into the most influential predictors.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a model with all features
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
initial_accuracy = accuracy_score(y_test, y_pred)
# Perform Backward Feature Elimination
all_features = list(X.columns)
selected_features = all_features.copy()
while len(selected_features) > 1: # Stopping criterion: keep at least one feature
# Train model with the current set of features
model.fit(X_train[selected_features], y_train)
y_pred = model.predict(X_test[selected_features])
# Evaluate model performance
current_accuracy = accuracy_score(y_test, y_pred)
# If removing a feature improves performance, remove the least important one
if current_accuracy >= initial_accuracy:
break
else:
least_important_feature = min(selected_features, key=lambda f: model.feature_importances_[selected_features.index(f)])
selected_features.remove(least_important_feature)
print("Optimal Feature-Subset:")
for feature in selected_features:
print(feature)
print("Accuracy: ",current_accuracy)
Optimal Feature-Subset: avg_temp avg_windspd avg_winddir avg_winddir_sin avg_winddir_cos max_cumulative_precip max_snow_density_6 max_cumulative_snow max_cumulative_ice Accuracy: 0.9996939322059836
The utilization of forward selection and backward elimination techniques in feature selection did not yield significant advantages in this study. Prior to the commencement of the Bachelor's thesis, a thorough evaluation of variables was conducted based on existing literature and relevant metrics. This meticulous pre-selection process ensured that only essential variables were included in the dataset, aligning with established theoretical foundations and methodological considerations.
Given the thoughtful consideration applied during the variable selection phase, where variables were chosen based on their significance and relevance, the additional application of forward selection and backward elimination did not contribute substantially to the refinement of the feature set. The initial careful curation of variables, guided by a comprehensive understanding of the research domain, metrics, and existing knowledge, rendered these automated techniques less impactful in enhancing the model's performance.
Therefore, the decision to forego extensive reliance on forward selection and backward elimination was grounded in the belief that the prior knowledge-driven variable selection approach had already captured the essential components for the study, ensuring a focused and meaningful analysis.
5.3.2.1.4 Write Results to Df
Write these relevant variables and temperature to new dataframe
# Extract relevant columns from the DataFrame
results_df = df[["run_datetime","avg_temp", "avg_winddir", "avg_windspd","avg_winddir_sin","avg_winddir_cos","max_cumulative_precip","max_snow_density_6","max_cumulative_snow","max_cumulative_ice"]]
# Set the 'run_datetime' column as the DataFrame index and convert it to datetime format
results_df['run_datetime'] = pd.to_datetime(results_df['run_datetime'])
results_df.set_index('run_datetime', inplace=True)
# Remove rows with duplicated indices, keeping only the first occurrence
results_df = results_df.loc[~results_df.index.duplicated(keep='first')]
# Set "avg_temp_change" to 0 in the first row, as there is no temp change
results_df.iloc[0, results_df.columns.get_loc("avg_temp")] = 0
# Display the first few rows of the resulting DataFrame
results_df.head()
| avg_temp | avg_winddir | avg_windspd | avg_winddir_sin | avg_winddir_cos | max_cumulative_precip | max_snow_density_6 | max_cumulative_snow | max_cumulative_ice | |
|---|---|---|---|---|---|---|---|---|---|
| run_datetime | |||||||||
| 2015-07-15 00:00:00 | 0.000000 | 80.302464 | 3.386380 | -0.981653 | 0.190676 | 2.009 | 0.0 | 0.0 | 0.0 |
| 2015-07-15 01:00:00 | 287.378997 | 76.866373 | 3.326687 | 0.994736 | 0.102466 | 1.209 | 0.0 | 0.0 | 0.0 |
| 2015-07-15 02:00:00 | 287.388845 | 76.258867 | 3.243494 | 0.758262 | 0.651950 | 0.400 | 0.0 | 0.0 | 0.0 |
| 2015-07-15 03:00:00 | 287.427324 | 78.299616 | 3.145505 | 0.237897 | -0.971290 | 0.000 | 0.0 | 0.0 | 0.0 |
| 2015-07-15 04:00:00 | 287.489158 | 84.632852 | 3.047607 | 0.189006 | -0.981976 | 0.000 | 0.0 | 0.0 | 0.0 |
5.3.2.2.1 Temperature and Wind Modeling over Time - Select Model (Multiple Linear Regression)
In the initial phase of the Temperature and Wind Modeling over Time analysis, a Multiple Linear Regression is conducted, as introduced in the lecture. Multiple regression is akin to linear regression but involves more than one independent variable, meaning an attempt is made to predict a value based on two or more variables.
5.3.2.2.2 Temperature and Wind Modeling over Time - Training and Validation (Multiple Linear Regression)
In the following analysis, three linear regression models are trained to predict different meteorological parameters.
In the first model, the target variable is avg_temp, and the predictor variables include avg_windspd and avg_winddir.
For the second model, avg_temp remains the target variable, incorporating all previously evaluated variables.
Lastly, the third model focuses on predicting avg_winddir with the inclusion of all previously assessed variables.
These models aim to capture relationships between temperature, wind speed, wind direction, and other relevant features, providing insights into the complex dynamics of the observed meteorological data.
The models will be evaluated with the following plots:
Scatter Plot of True vs. Predicted Values:
Residual Plot:
Histogram of Residuals:
Multiple Linear Regression (MLR) with target variable "Average Temperature" and predictor variables "Average Windspeed" and "Wind Direction"
# Extracting the target variable and predictor variables
y = results_df['avg_temp']
X = results_df[['avg_windspd','avg_winddir']]
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Training (Multiple Linear Regression)
regr = LinearRegression()
regr.fit(X, y)
# Fit Model (Multiple Linear Regression)
y_pred = regr.predict(X_test)
#Evaluate on Test Set (Multiple Linear Regression)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error:', mae)
r2 = r2_score(y_test, y_pred)
print('R-squared:', r2)
Mean Squared Error: 128.267893719392 Mean Absolute Error: 9.658907432391409 R-squared: 0.015238184007991595
The very low R-squared value that the linear model does not perform well on the given dataset.
Visual evaluations of the regression model will be conducted through a series of plots to assess prediction accuracy and identify any patterns in the residuals that may suggest model adjustments.
# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Scatter Plot
axes[0].scatter(y_test, y_pred)
axes[0].set_xlabel('True Values')
axes[0].set_ylabel('Predictions')
axes[0].set_title('Scatter Plot of True vs. Predicted Values')
# Residual Plot
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
# Histogram of Residuals
sns.histplot(residuals, kde=True, ax=axes[2])
axes[2].set_xlabel('Residuals')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Histogram of Residuals')
# Add title
fig.suptitle('MLR with target variable "Average Temperature" and predictor variables "Average Windspeed" and "Wind Direction"')
# Adjust layout
plt.tight_layout()
# Show plots
plt.show()
The visualizations collectively indicate how well the Multiple Linear Regression (MLR) model performs in predicting average temperature from wind speed and direction.
Scatter Plot Analysis: The scatter plot reveals a dense cluster of points without a clear linear pattern, indicating that the model does not predict the average temperature accurately across the range of true values. The cloud-like distribution of points, rather than a tight diagonal line, suggests that the model's predictions are not consistently aligned with the true values, leading to a broad spread of prediction errors.
Residual Plot Analysis: The residual plot shows a wide dispersion of residuals across the predicted temperature range, with no discernible pattern. The lack of structure implies that the model errors are somewhat random, but the spread of residuals indicates variability in the model's performance across different temperature ranges. The concentration of points around the zero line for lower predicted values and the spread for higher values suggest the model may be less accurate at extreme temperatures.
Histogram of Residuals Analysis: The histogram of residuals displays a distribution with two peaks, deviating from the expected normal distribution, which would typically indicate a well-fitting model. The skewness to the left implies that the model more frequently underestimates the temperature (more negative residuals). The presence of two peaks could indicate the model performs differently for two subsets of the data, possibly corresponding to different conditions or time periods within the dataset.
Conclusion: The visual assessments suggest that the model's predictive power is limited. The residuals' behavior indicates potential model inadequacies, possibly due to omitted variables, non-linear relationships, or other model specification issues. The results highlight the need for further model refinement, such as incorporating additional predictive features, exploring non-linear models, or segmenting the data for different conditions. The ultimate goal is to achieve a more homogenous distribution of residuals and a narrower scatter plot distribution around the line of perfect prediction.
Temperature and Wind Modeling over Time - Save Model (Multiple Linear Regression)
# Create a meaningful name with a timestamp
model_name = f"MLR_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name, 'wb') as f:
pickle.dump(regr, f)
print(f"The MLR model has been saved to '{folder_path + model_name}'.")
The MLR model has been saved to '../models/MLR_model_20231229_1802.pkl'.
Multiple Linear Regression (MLR) with target variable "Average Temperature" and multiple predictor variables
A Multiple Linear Regression model with average temperature as the target variable is constructed using a wider range of predictors. This model seeks to improve upon previous ones by utilizing a more extensive set of features that may influence temperature. The expected outcome is a more accurate prediction, as indicated by improvements in the Mean Squared Error, Mean Absolute Error, and R-squared metrics.
# Extracting the target variable and predictor variables
y = results_df['avg_temp']
X = results_df.drop(columns=['avg_temp'])
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Training (Multiple Linear Regression)
regr = LinearRegression()
regr.fit(X, y)
# Fit Model (Multiple Linear Regression)
y_pred = regr.predict(X_test)
#Evaluate on Test Set (Multiple Linear Regression)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error:', mae)
r2 = r2_score(y_test, y_pred)
print('R-squared:', r2)
Mean Squared Error: 103.25403414868913 Mean Absolute Error: 8.390307032430659 R-squared: 0.20727917775583216
The R-squared value has increased compared to before, but it remains relatively low.
Additional visualizations will be created to analyze the regression model's performance, including a scatter plot, a residual plot, and a histogram of residuals. These will help identify any remaining discrepancies between predicted and actual values.
# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Scatter Plot
axes[0].scatter(y_test, y_pred)
axes[0].set_xlabel('True Values')
axes[0].set_ylabel('Predictions')
axes[0].set_title('Scatter Plot of True vs. Predicted Values')
# Residual Plot
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
# Histogram of Residuals
sns.histplot(residuals, kde=True, ax=axes[2])
axes[2].set_xlabel('Residuals')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Histogram of Residuals')
# Add title
fig.suptitle('MLR with target variable "Average Temperature" and multiple predictor variables')
# Adjust layout
plt.tight_layout()
# Show plots
plt.show()
Scatter Plot Analysis: The plot shows that predictions are not well-aligned with the true values; particularly, for true values in the mid-range, predictions are compressed towards the center of the range. This suggests a model bias towards average temperatures and difficulty in accurately predicting more extreme temperatures.
Residual Plot Analysis: Residuals are not randomly dispersed around the zero line, which would indicate accurate predictions. Instead, there's a pronounced pattern where predictions for mid-range temperatures are too high (positive residuals) and for the extremes are too low (negative residuals), particularly at the lower end of the temperature scale.
Histogram of Residuals Analysis: The histogram reveals a bimodal distribution, indicating two different patterns of residuals within the data. Instead of the expected normal distribution, this model's residuals suggest that there may be two distinct subsets of data, or that the model's linear assumptions are not fully capturing the underlying relationships.
Conclusion: These plots confirm that while the model with multiple predictors performs better than the simpler models, it still struggles to accurately predict temperatures at the extremes and tends to predict closer to the mean value, leading to systematic errors. This suggests the need for further model refinement or the exploration of non-linear models to better capture the complexity of the relationships between the variables.
Temperature and Wind Modeling over Time - Save Model (Multiple Linear Regression)
# Create a meaningful name with a timestamp
model_name = f"MLR_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name, 'wb') as f:
pickle.dump(regr, f)
print(f"The MLR model has been saved to '{folder_path + model_name}'.")
The MLR model has been saved to '../models/MLR_model_20231229_1802.pkl'.
Multiple Linear Regression (MLR) with target variable "Average Winddirection" and multiple predictor variables
# Extracting the target variable and predictor variables
y = results_df['avg_windspd']
X = results_df.drop(columns=['avg_windspd'])
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Training (Multiple Linear Regression)
regr = LinearRegression()
regr.fit(X, y)
# Fit Model (Multiple Linear Regression)
y_pred = regr.predict(X_test)
#Evaluate on Test Set (Multiple Linear Regression)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error:', mse)
mae = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error:', mae)
r2 = r2_score(y_test, y_pred)
print('R-squared:', r2)
Mean Squared Error: 0.5038677092424138 Mean Absolute Error: 0.5636674787101482 R-squared: 0.1563048200931636
The R-squared value is extremely low, indicating poor model performance.
# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Scatter Plot
axes[0].scatter(y_test, y_pred)
axes[0].set_xlabel('True Values')
axes[0].set_ylabel('Predictions')
axes[0].set_title('Scatter Plot of True vs. Predicted Values')
# Residual Plot
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('Predicted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residual Plot')
# Histogram of Residuals
sns.histplot(residuals, kde=True, ax=axes[2])
axes[2].set_xlabel('Residuals')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Histogram of Residuals')
# Add title
fig.suptitle('Multiple Linear Regression (MLR) with target variable "Average Winddirection" and multiple predictor variables')
# Adjust layout
plt.tight_layout()
# Show plots
plt.show()
Scatter Plot Analysis: The scatter plot shows a lack of a definitive pattern or linear relationship, with data points widely spread out. Ideally, we would expect to see points along a diagonal line if the predictions closely matched the true values. The broad spread indicates the model's predictions vary significantly from the actual wind direction values, particularly for lower and higher true values.
Residual Plot Analysis: The residual plot does not show the random scatter expected of a well-performing model. Instead, the spread of residuals indicates systematic errors in predictions across all ranges. The absence of a clear trend or clustering around the zero line suggests the model is not accurately capturing the relationship between wind direction and the predictors.
Histogram of Residuals Analysis: The histogram reveals a right-skewed distribution, with a tail extending towards the positive side, suggesting that the model has a tendency to underpredict the wind direction. The peak of the histogram near zero indicates that for a significant number of predictions, the model's errors are small, but the long tail highlights that there are also many instances where the model's predictions are quite far off.
Conclusion: These visualizations indicate that the model's ability to predict 'Average Winddirection' is limited. The skewness in the residual histogram and the widespread in the scatter plot suggest that the model's current predictors do not capture all the factors influencing wind direction, or that the relationship between wind direction and the predictors might be non-linear or influenced by other variables not included in the model. Further model refinement or the use of different modeling techniques, such as non-linear models or inclusion of interaction terms, might improve predictive performance.
Temperature and Wind Modeling over Time - Save Model (Multiple Linear Regression)
# Create a meaningful name with a timestamp
model_name = f"MLR_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name, 'wb') as f:
pickle.dump(regr, f)
print(f"The MLR model has been saved to '{folder_path + model_name}'.")
The MLR model has been saved to '../models/MLR_model_20231229_1802.pkl'.
Conclusion:
The charts and metrics indicate that all models perform poorly. In general, it can be concluded that neither temperature nor wind speed can be well predicted based on the given data. The use of all evaluated features (from the feature selection process) represents an improvement compared to exclusively using the variables "Windspeed" and "Winddirection.
This could be due to the fact that weather data often exhibits non-linear relationships that cannot be captured by a linear regression. Additionally, the presence of outliers and anomalies may distort the results. Given that weather events and temperature fluctuations frequently have highly complex underlying factors, these interactions might be too intricate to be adequately addressed by a Multiple Linear Regression. There is also the possibility of missing relevant variables that could enhance the prediction. It's essential to consider more sophisticated modeling approaches, such as non-linear regression or machine learning algorithms, to better capture the complexities inherent in weather data. Additionally, thorough data analysis, outlier handling, and consideration of domain-specific knowledge are crucial for improving the performance of the predictive models.
Consequently, the multiple linear regression model is challenging to apply to the given data without further adjustments.
After predicting the temperature parameter using a multiple predictor linear regression worked very well and produced good results, it should now be determined how, using statistical, big data and machine learning techniques, the temperature paramater can be forecasted. For this analysis training data from up until 2021 is used and data from the year 2022 will be predicted, thus simulating a temperature forecast. Different statistical time series and machine learning techniques are used for this approach. The result is not supposed to be a very exact prediction of the temperature for the year 2022 or even some single weeks or days in it, but rather a basic and general development and progression curve of the weather parameter for the given forecasted or predicted test data.
5.3.3.1.1 Temperature and Wind Modeling over Time - Select Model (SARIMAX)
The ARIMA (Auto-Regressive Integrated Moving Average) approach is one of the most cost-effective and reliable time series techniques. As these exogenous variable Wind should be included and seasonality is considered, the extended SARIMAX Model is utilzed here 1. SARIMAX (Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors) models are among the most widely used statistical models for forecasting, with excellent forecasting performance 2.
By looking at the visualization of the temperature from above (chapter 5.3.2.2.3) some seasonality in data can be noticed. Besides the seasonality, it is also relevant to analyse the underlying trend. The examination of trends enables the identification of long-term patterns, aiding in the understanding of climate change and gradual shifts in meteorological conditions. Seasonality, on the other hand, unveils recurring patterns that follow a consistent cycle, offering insights into predictable variations such as temperature fluctuations and precipitation changes throughout the year. Meanwhile, scrutinizing residuals reveals irregularities and unexplained components in the data, allowing for the detection of anomalies or sudden deviations from expected weather patterns. This holistic exploration of trend, seasonality, and residuals not only enhances the ability to make accurate weather predictions but also empowers meteorologists and climate scientists to comprehend the broader dynamics of the atmosphere, contributing to a more comprehensive and nuanced understanding of weather phenomena.
In the next step, a monthly time series decomposition in Trend, Seasonality and Trend will be performed:
This holistic exploration of trend, seasonality, and residuals not only enhances understanding in the data and therefore ability to make accurate weather predictions but also empowers meteorologists and climate scientists to comprehend the broader dynamics of the atmosphere, contributing to a more comprehensive and nuanced understanding of weather phenomena.
Elshewey, Ahmed M./Mahmoud Y. Shams/Abdelghafar M. Elhady/Samaa M. Shohieb/Abdelaziz A. Abdelhamid/Abdelhameed Ibrahim/Zahraa Tarek (2022): A novel WD-SARIMAX model for temperature forecasting using Daily Delhi Climate Dataset, in: Sustainability, vol. 15, no. 1, p. 757, [online] doi:10.3390/su15010757.↩
Ortiz, Joaquin Amat Rodrigo and Javier Escobar (n.d.): Forecasting SARIMAX and ARIMA models - Skforecast Docs, [online] https://joaquinamatrodrigo.github.io/skforecast/0.7.0/user_guides/forecasting-sarimax-arima.html#.↩
Nguyen, Bruce (2021): End-to-End Time Series Analysis and Forecasting: a Trio of SARIMAX, LSTM and Prophet (Part 1), in: Medium, 29.12.2021, [online] https://towardsdatascience.com/end-to-end-time-series-analysis-and-forecasting-a-trio-of-sarimax-lstm-and-prophet-part-1-306367e57db8.↩
Create df for sarimax prediction:
# Extract relevant columns from the DataFrame
df_temp_wind = df_full[["run_datetime", "avg_temp", "avg_windspd", "avg_winddir", "avg_windgust"]]
# Set the 'run_datetime' column as the DataFrame index and convert it to datetime format
df_temp_wind.index = pd.to_datetime(df_temp_wind['run_datetime'])
# Drop the original 'run_datetime' column as it's now set as the index
df_temp_wind.drop(columns='run_datetime', inplace=True)
# Remove rows with duplicated indices, keeping only the first occurrence
df_temp_wind = df_temp_wind.loc[~df_temp_wind.index.duplicated(keep='first')]
# Display the first few rows of the resulting DataFrame
df_temp_wind.head()
| avg_temp | avg_windspd | avg_winddir | avg_windgust | |
|---|---|---|---|---|
| run_datetime | ||||
| 2015-07-15 00:00:00 | 287.389224 | 3.386380 | 80.302464 | 11.136197 |
| 2015-07-15 01:00:00 | 287.378997 | 3.326687 | 76.866373 | 11.002795 |
| 2015-07-15 02:00:00 | 287.388845 | 3.243494 | 76.258867 | 10.700595 |
| 2015-07-15 03:00:00 | 287.427324 | 3.145505 | 78.299616 | 10.323983 |
| 2015-07-15 04:00:00 | 287.489158 | 3.047607 | 84.632852 | 9.921157 |
The time series data will be decomposed into trend, seasonality, and residuals, allowing for an in-depth analysis of underlying patterns. This step is crucial for enhancing the predictive accuracy of the SARIMAX model in temperature and wind forecasting. By examining these components individually, a more precise understanding of weather dynamics can be achieved, leading to improved model performance and insight into climatic behaviors.
from statsmodels.tsa.seasonal import seasonal_decompose
def decompose_and_plot(data, variable_name, position):
# Decompose the time series data
decompose_data = seasonal_decompose(data.asfreq('M'), model="additive", extrapolate_trend='freq')
# Plot the original data
plt.subplot(4, 4, position)
data.plot()
plt.title(f'Original Data of {variable_name}')
# Plot the trend component
plt.subplot(4, 4, position + 1)
decompose_data.trend.plot()
plt.title(f'Trend of {variable_name}')
# Plot the seasonal component
plt.subplot(4, 4, position + 2)
decompose_data.seasonal.plot()
plt.title(f'Seasonal of {variable_name}')
# Plot the residual component
plt.subplot(4, 4, position + 3)
decompose_data.resid.plot()
plt.title(f'Residual of {variable_name}')
# Create a figure with a size of 16x12 inches
plt.figure(figsize=(16, 12))
# Decompose and plot for avg_temp
decompose_and_plot(df_temp_wind["avg_temp"], "avg_temp", 1)
# Decompose and plot for avg_winddir
decompose_and_plot(df_temp_wind["avg_winddir"], "avg_winddir", 5)
# Decompose and plot for avg_windspd
decompose_and_plot(df_temp_wind["avg_windspd"], "avg_windspd", 9)
# Decompose and plot for avg_windgust
decompose_and_plot(df_temp_wind["avg_windgust"], "avg_windgust", 13)
# Adjust layout to prevent overlap
plt.tight_layout()
# Display the plots
plt.show()
plt.savefig('../output/Linear Regression Analysis Multiple Predictors Seasonality and Trend.png')
<Figure size 2000x1500 with 0 Axes>
The provided time series plots offer a comprehensive view of the data's characteristics by breaking down each variable into its original, trend, seasonal, and residual components.
Average Temperature (avg_temp):
Original Data: Displays a clear, repetitive oscillation, indicative of strong seasonality in temperature data. Trend: There seems to be slight variability, with some periods showing a gentle rise or fall, but due to the nature of temperature data, which is highly seasonal, this doesn't necessarily indicate a long-term trend. Seasonality: Shows consistent and expected cyclical patterns corresponding to the seasons, with peaks typically in the warmer months and troughs in the cooler months. Residuals: Fluctuations around the zero line without any discernible pattern suggest that most of the systematic information has been captured by the trend and seasonal components.
Average Wind Direction (avg_winddir):
Original Data: Appears relatively constant over time, although there is huge volatility. Trend: No discernible long-term trend is visible; the data fluctuates around a mean without a clear direction. Seasonality: The seasonal plot does not show a clear pattern, which is typical for wind direction as it can be influenced by various unpredictable factors. Residuals: The residuals are large in magnitude compared to the other variables, reflecting the high variability in wind direction that is not captured by the seasonal or trend components.
Average Wind Speed (avg_windspd):
Original Data: Exhibits huge variance over time with what appears to be seasonal peaks and troughs. Trend: A very subtle decrease in wind speed is observed, but given the short time frame and inherent variability of wind speed data, it's hard to confirm this as a significant trend. Seasonality: Clear seasonal patterns are evident, likely reflecting the influence of larger atmospheric conditions that affect wind speed. Residuals: The residuals remain after removing the trend and seasonal effects, showing random variability which is typical for wind speed data.
Average Wind Gust (avg_windgust):
Original Data: Similar to average wind speed, it shows variability that suggests a seasonal influence. Trend: There is a suggestion of a decreasing trend, but as with wind speed, the inherent variability and short time frame make it difficult to ascertain a definitive trend. Seasonality: Displays a seasonal cycle with peaks and troughs, indicating that wind gusts are stronger in some seasons than others. Residuals: The residuals indicate the random fluctuations that are left after accounting for the systematic components of the data.
Conclusion: The decomposition of each variable into trend, seasonality, and residuals has revealed that while there is a strong seasonal component across all variables, the trends are not as clear due to the short timespan and natural variability in the data. The residual plots do not exhibit any additional patterns, suggesting that the decomposed components capture the primary behaviors of the dataset. This analysis confirms the cyclical nature of weather-related data and underlines the importance of seasonal adjustment in forecasting models like SARIMAX. The lack of pattern in the residuals is a positive indicator that the model has adequately captured the data's structure.
Augmented Dickey-Fuller Test:
Hereinafter the Augmented Dickey-Fuller Test will be performed to determine wether the given time series is stationary or not. The ADF Test tests the null hypothesis that α=1. When the test statistic is lower than the critical value shown, the null hypothesis is rejcted and the time series is as stationary inferred 1. A stationary time series is one that has its mean, variance and autocorrelation structure unchanging overtime. In other words, it does not have any cycle/trend or seasonality.
Prabhakaran, Selva (2022): Augmented Dickey Fuller Test (ADF Test) – must read guide, Machine Learning Plus, [online] https://www.machinelearningplus.com/time-series/augmented-dickey-fuller-test/.↩
from statsmodels.tsa.stattools import adfuller
def adf_test(variable, name):
# Perform the Augmented Dickey-Fuller test
dftest = adfuller(variable, autolag='AIC')
# Print the results
print(f"{name}: ")
print(f"1. ADF : {dftest[0]}")
print(f"2. P-Value : {dftest[1]}")
print(f"3. Num Of Lags : {dftest[2]}")
print(f"4. Num Of Observations Used For ADF Regression and Critical Values Calculation : {dftest[3]}")
print("5. Critical Values :")
for key, val in dftest[4].items():
print(f"\t{key} : {val}")
print("\n")
# Perform ADF test for each variable
adf_test(df_temp_wind["avg_temp"], "Average Temperature")
adf_test(df_temp_wind["avg_winddir"], "Average Wind Direction")
adf_test(df_temp_wind["avg_windspd"], "Average Wind Speed")
adf_test(df_temp_wind["avg_windgust"], "Average Wind Gust")
Average Temperature: 1. ADF : -5.6633939102963415 2. P-Value : 9.268984051353608e-07 3. Num Of Lags : 61 4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65258 5. Critical Values : 1% : -3.4304502108131385 5% : -2.8615842913495717 10% : -2.566793574780786 Average Wind Direction: 1. ADF : -26.935153074526742 2. P-Value : 0.0 3. Num Of Lags : 54 4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65265 5. Critical Values : 1% : -3.4304502000646013 5% : -2.861584286598995 10% : -2.5667935722522013 Average Wind Speed: 1. ADF : -26.388919718643297 2. P-Value : 0.0 3. Num Of Lags : 53 4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65266 5. Critical Values : 1% : -3.4304501985292846 5% : -2.861584285920424 10% : -2.566793571891019 Average Wind Gust: 1. ADF : -24.04907190849269 2. P-Value : 0.0 3. Num Of Lags : 61 4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 65258 5. Critical Values : 1% : -3.4304502108131385 5% : -2.8615842913495717 10% : -2.566793574780786
It can be seen, that the ADF test statistic for average temperature is -5.6634 with a p-value of 9.27e-07, indicating a rejection of the null hypothesis of a unit root. The negative ADF value, lower than the critical values at the 1%, 5%, and 10% significance levels, supports the conclusion of stationarity in the average temperature time series.
For the other variables with a p-value of 0, which is much less than the significance level of 0.05, the null hypothesis can be rejected, and the series can be considered stationary.
Since the ADF tests have indicated stationarity for the variables, it can be proceeded to modelling.
5.3.3.1.2 Temperature and Wind Modeling over Time - Training and Validation
In the following the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are performed for the SARIMAX model selection and evaluation.
These criteria provide quantitative measures of the trade-off between model goodness-of-fit and complexity, aiding in the identification of the most suitable model for a given time series. A lower AIC or BIC indicates a better-fitting model, considering both the accuracy in capturing the observed data and the penalization for model complexity. By systematically exploring various combinations of model parameters and assessing their AIC and BIC values, practitioners can effectively navigate the model space, improving the likelihood of selecting a SARIMAX model that strikes an optimal balance between explanatory power and simplicity, leading to more reliable and generalizable predictions 1.
Zach (2021): How to calculate AIC of regression models in Python, Statology, [online] https://www.statology.org/aic-in-python/.↩
Akaike Information Criterion (AIC):
import itertools
# Define response variable
y = df_temp_wind['avg_temp']
# Define predictor variables
predictor_vars = ['avg_windspd', 'avg_winddir', 'avg_windgust']
# Initialize variables to store best AIC and best predictor combination
best_aic = float('inf')
best_predictors = None
# Iterate through all combinations of predictor variables
for r in range(1, len(predictor_vars) + 1):
for combo in itertools.combinations(predictor_vars, r):
# Select the current combination of predictors
current_predictors = list(combo)
# Define the predictors and add a constant
x = df_temp_wind[current_predictors]
x = sm.add_constant(x)
# Fit the regression model
model = sm.OLS(y, x).fit()
# Update best predictors if the current AIC is lower
if model.aic < best_aic:
best_aic = model.aic
best_predictors = current_predictors
# Print the best combination of predictors and its AIC
print(f"Best Predictors: {best_predictors}")
print(f"Best AIC: {best_aic}")
Best Predictors: ['avg_windspd', 'avg_winddir'] Best AIC: 502202.826219331
Bayesian Information Criterion (BIC):
# Define response variable
y = df_temp_wind['avg_temp']
# Define predictor variables
predictor_vars = ['avg_windspd', 'avg_winddir','avg_windgust']
# Initialize variables to store best BIC and best predictor combination
best_bic = float('inf')
best_predictors_bic = None
# Iterate through all combinations of predictor variables
for r in range(1, len(predictor_vars) + 1):
for combo in itertools.combinations(predictor_vars, r):
# Select the current combination of predictors
current_predictors = list(combo)
# Define the predictors and add a constant
x = df_temp_wind[current_predictors]
x = sm.add_constant(x)
# Fit the regression model
model = sm.OLS(y, x).fit()
# Update best predictors if the current BIC is lower
if model.bic < best_bic:
best_bic = model.bic
best_predictors_bic = current_predictors
# Print the best combination of predictors and its BIC
print(f"Best Predictors (BIC): {best_predictors_bic}")
print(f"Best BIC: {best_bic}")
Best Predictors (BIC): ['avg_windspd', 'avg_winddir'] Best BIC: 502230.0873799723
AIC and BIC both shows that the best predictor variables are 'avg_windspd' and 'avg_winddir'.
Auto ARIMA
AutoARIMA is an automated time series forecasting algorithm based on the ARIMA (AutoRegressive Integrated Moving Average) model. The algorithm automates the process of selecting the optimal model parameters (p, d, q) by searching through different combinations and choosing the model with the best fit based on a specified criterion, such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). It integrates both non-seasonal and seasonal components to capture patterns and trends in time series data effectively.
For the given dataset, which consists of hourly temperature readings, the choice of AutoARIMA parameters is guided by the characteristics of the time series. The m=24 parameter is selected to account for the potential daily seasonality, assuming a repeating pattern every 24 hours. Starting with start_p=1 and start_q=1 provides a sensible initial exploration of autoregressive and moving average components. The max_p=3 and max_q=3 values limit the search space for these components, preventing excessive model complexity. The inclusion of D=1 indicates that seasonal differencing is considered, and trace=True allows for the monitoring of the algorithm's progress. These choices aim to strike a balance between capturing the underlying patterns in the data while preventing overfitting.
from pmdarima import auto_arima
auto_arima_results= auto_arima(df_temp_wind["avg_temp"],
start_p=1,
start_q=1,
max_p=3,
max_q=3, # maximum p and q
m=24, # frequency of series
d=None, # let model determine 'd'
seasonal=False, # No Seasonality
start_P=0,
D=1,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
auto_arima_results.summary()
Performing stepwise search to minimize aic ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=-300996.225, Time=7.82 sec ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=-30853.412, Time=2.62 sec ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=-272301.363, Time=3.90 sec ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=-116812.636, Time=10.22 sec ARIMA(0,1,0)(0,0,0)[0] : AIC=-30855.181, Time=1.35 sec ARIMA(2,1,1)(0,0,0)[0] intercept : AIC=-315512.530, Time=3.67 sec ARIMA(2,1,0)(0,0,0)[0] intercept : AIC=-315504.206, Time=4.01 sec ARIMA(3,1,1)(0,0,0)[0] intercept : AIC=-315506.573, Time=3.23 sec ARIMA(2,1,2)(0,0,0)[0] intercept : AIC=-315513.680, Time=8.53 sec ARIMA(1,1,2)(0,0,0)[0] intercept : AIC=-309847.293, Time=22.26 sec ARIMA(3,1,2)(0,0,0)[0] intercept : AIC=-309557.825, Time=40.41 sec ARIMA(2,1,3)(0,0,0)[0] intercept : AIC=-315534.509, Time=12.44 sec ARIMA(1,1,3)(0,0,0)[0] intercept : AIC=-312901.486, Time=18.11 sec ARIMA(3,1,3)(0,0,0)[0] intercept : AIC=-315657.922, Time=10.30 sec ARIMA(3,1,3)(0,0,0)[0] : AIC=-315828.456, Time=23.75 sec ARIMA(2,1,3)(0,0,0)[0] : AIC=-315536.495, Time=5.94 sec ARIMA(3,1,2)(0,0,0)[0] : AIC=inf, Time=30.87 sec ARIMA(2,1,2)(0,0,0)[0] : AIC=-315515.665, Time=4.10 sec Best model: ARIMA(3,1,3)(0,0,0)[0] Total fit time: 213.543 seconds
| Dep. Variable: | y | No. Observations: | 65320 |
|---|---|---|---|
| Model: | SARIMAX(3, 1, 3) | Log Likelihood | 157921.228 |
| Date: | Fri, 29 Dec 2023 | AIC | -315828.456 |
| Time: | 18:07:02 | BIC | -315764.847 |
| Sample: | 0 | HQIC | -315808.775 |
| - 65320 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | 2.4725 | 0.021 | 116.934 | 0.000 | 2.431 | 2.514 |
| ar.L2 | -2.0030 | 0.038 | -52.033 | 0.000 | -2.078 | -1.928 |
| ar.L3 | 0.5260 | 0.018 | 29.575 | 0.000 | 0.491 | 0.561 |
| ma.L1 | -0.8029 | 0.021 | -37.572 | 0.000 | -0.845 | -0.761 |
| ma.L2 | -0.0242 | 0.005 | -5.296 | 0.000 | -0.033 | -0.015 |
| ma.L3 | -0.0307 | 0.006 | -5.555 | 0.000 | -0.042 | -0.020 |
| sigma2 | 0.0005 | 1.2e-06 | 384.413 | 0.000 | 0.000 | 0.000 |
| Ljung-Box (L1) (Q): | 4.99 | Jarque-Bera (JB): | 230383.67 |
|---|---|---|---|
| Prob(Q): | 0.03 | Prob(JB): | 0.00 |
| Heteroskedasticity (H): | 1.00 | Skew: | -0.09 |
| Prob(H) (two-sided): | 0.91 | Kurtosis: | 12.20 |
The analysis using the AutoARIMA model has provided insights into the time series data for average temperature (avg_temp). The selected model parameters for the SARIMAX model are (3,1,3)(0,0,0)[0], where:
p (order of autoregressive term): 3 d (degree of differencing): 1 q (order of moving average term): 3 P (seasonal order of autoregressive term): 0 D (seasonal degree of differencing): 0 Q (seasonal order of moving average term): 0 m (seasonal frequency): 24 (daily seasonality) Here is a summary of the key findings:
Model Fit: The SARIMAX model with these parameters achieves an impressive Log Likelihood of 157,921.228, indicating a strong fit to the data.
Information Criteria: The Akaike Information Criterion (AIC) is -315,828.456, and the Bayesian Information Criterion (BIC) is -315,764.847. These values are significant improvements over alternative models, suggesting that this SARIMAX model is the best choice for forecasting temperature based on the AIC and BIC criteria.
Model Coefficients: The model coefficients indicate significant autoregressive (AR) and moving average (MA) terms. The coefficients are within acceptable ranges, indicating stability and significance.
Residuals: The Ljung-Box test (Q) indicates that there is no significant autocorrelation in the residuals, and the Jarque-Bera test (JB) suggests that the residuals are normally distributed. This confirms that the model has successfully captured the underlying patterns in the data.
Heteroskedasticity and Skew: The model does not exhibit heteroskedasticity (significant H value), and the skewness is close to zero, indicating a well-behaved model.
Conclusion: The SARIMAX model with the parameters (3,1,3)(0,0,0)[0] and a seasonal frequency of 24 provides an excellent fit to the time series data for average temperature. The model's ability to capture the data's underlying patterns is reflected in the high Log Likelihood and the absence of significant autocorrelation in the residuals. Additionally, both the AIC and BIC criteria support the selection of this model as the best predictor for temperature forecasting.
This robust model can be used to make accurate temperature forecasts and gain valuable insights into temperature variations, which is crucial for various applications such as weather forecasting and climate analysis.
The next step is the diagnostic phase, which visualizes the model's diagnostics. Visualizing the diagnostic results allows us to assess how well the SARIMAX model adapts to the temperature data, ensuring the reliability of its predictions. It can help evaluate its ability to capture data patterns and identify any discrepancies between observed and predicted values. Furthermore visualization helps verify whether the model's assumptions are met, such as the normal distribution of residuals and the absence of autocorrelation. By validating these assumptions, we ensure the model's robustness and reliability.
auto_arima_results.plot_diagnostics(figsize=(10,8))
plt.show()
Standardized Residuals:
The top-left plot displays standardized residuals over time, representing the differences between observed values and the model's predictions, standardized by the standard deviation of the residuals. A model that fits well would show residuals scattered randomly around zero without discernible patterns. In the provided plot, no obvious trends or seasonality are observed in the residuals, suggesting that the model has adequately captured the underlying structure of the data.
Histogram plus Estimated Density:
The top-right plot features a histogram of the residuals with a Kernel Density Estimate (KDE) and a Normal distribution curve superimposed. The histogram illustrates the frequency distribution of the residuals, while the KDE provides a smooth estimate of this distribution. The closer the KDE and the histogram align with the normal curve, the more the residuals approximate a normal distribution. The plot indicates a close fit between the KDE and the Normal distribution curve, suggesting that the residuals are nearly normally distributed, which meets an assumption of the ARIMA model.
Normal Q-Q Plot:
The bottom-left plot, the Quantile-Quantile plot, compares the quantiles of the residuals to the quantiles of a Normal distribution. Points on the Q-Q plot should lie approximately on the 45-degree reference line if the residuals are normally distributed. Some deviation from the line at the extremes is evident in the Q-Q plot, indicating that the extreme values of the residuals may deviate from a perfect normal distribution. However, the majority of points do closely follow the reference line.
Correlogram:
The bottom-right plot, the correlogram, depicts the autocorrelation of the residuals at various lags. For a well-fitted model, the autocorrelations for all lag values are expected to lie within the confidence interval, denoting an absence of significant autocorrelation in the residuals. The correlogram shows all autocorrelations within the expected confidence bounds, implying that the residuals are free from autocorrelation, thus indicating that the model has effectively captured the time series dependencies. These diagnostic plots suggest that the SARIMAX model with the specified parameters is adept at fitting to the average temperature data. The residuals exhibit the expected behavior for a well-specified model, with no evident patterns, a distribution approaching normality, and insignificant autocorrelation. This supports the use of the model for forecasting average temperature, providing assurance in its predictive reliability.
5.3.3.1.3 Temperature and Wind Modeling over Time - Fit Model (SARIMAX)
In the next step, a SARIMAX model is fitted to the temperature and wind data over time. This model is chosen because it can incorporate external factors like wind speed and direction, which influence temperature. Additionally, SARIMAX can handle seasonal patterns often found in weather data. The training and test datasets are defined, exogenous variables (wind speed and direction) are specified, hyperparameters are set, and the SARIMAX model is fitted. Key information about the model's performance and parameters is provided in the model summary. This step is crucial for accurate temperature forecasting and understanding the temperature-wind relationship.
from statsmodels.tsa.statespace.sarimax import SARIMAX
#manually define train and test data
train_df = df_temp_wind[df_temp_wind.index <= datetime(2022,12,15)]
test_df = df_temp_wind[df_temp_wind.index > datetime(2022,12,15)]
# Define exogenous variables
exog_vars = ['avg_windspd', 'avg_winddir']
# Set the frequency to hourly
train_df= train_df.asfreq('1H')
#Delete missing values
train_df = train_df.dropna(subset=exog_vars)
# Set Hyper-Parameters
p, d, q = 3, 1, 3
P, D, Q = 0, 0, 0
s = 24
# Fit SARIMAX
sarimax_model = SARIMAX(train_df['avg_temp'],
order=(p, d, q),
seasonal_order=(P, D, Q, s),
exog=train_df[exog_vars])
sarimax_model_fit = sarimax_model.fit()
# Summary
print(sarimax_model_fit.summary())
c:\Users\177119724\AppData\Local\Python\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. c:\Users\177119724\AppData\Local\Python\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
SARIMAX Results
==============================================================================
Dep. Variable: avg_temp No. Observations: 65024
Model: SARIMAX(3, 1, 3) Log Likelihood 157588.046
Date: Fri, 29 Dec 2023 AIC -315158.091
Time: 18:08:09 BIC -315076.349
Sample: 0 HQIC -315132.795
- 65024
Covariance Type: opg
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
avg_windspd 0.1454 0.004 36.921 0.000 0.138 0.153
avg_winddir -3.687e-05 9.41e-06 -3.918 0.000 -5.53e-05 -1.84e-05
ar.L1 2.4614 0.013 196.384 0.000 2.437 2.486
ar.L2 -1.9693 0.023 -84.233 0.000 -2.015 -1.924
ar.L3 0.5050 0.011 45.665 0.000 0.483 0.527
ma.L1 -0.8106 0.013 -64.192 0.000 -0.835 -0.786
ma.L2 -0.0463 0.004 -10.490 0.000 -0.055 -0.038
ma.L3 -0.0673 0.006 -12.051 0.000 -0.078 -0.056
sigma2 0.0005 1.27e-06 360.363 0.000 0.000 0.000
===================================================================================
Ljung-Box (L1) (Q): 6.34 Jarque-Bera (JB): 184518.11
Prob(Q): 0.01 Prob(JB): 0.00
Heteroskedasticity (H): 1.01 Skew: -0.06
Prob(H) (two-sided): 0.58 Kurtosis: 11.25
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
It achieved a Log Likelihood of 157,588.046, indicating a strong fit to the data. Both the AIC (-315,158.091) and BIC (-315,076.349) criteria support the selection of this SARIMAX model as the best predictor for temperature forecasting.
5.3.3.1.4 Temperature and Wind Modeling over Time - Evaluation on test set (SARIMAX)
Next, the SARIMAX model's performance will be evaluated on the test set. This evaluation is crucial to assess how well the model can make accurate temperature forecasts on unseen data.
Generate forecasts for the test period using the SARIMAX model, matching the length of the test dataset.
Plot both the actual average temperature values and the forecasted values to visually compare their trends. This evaluation will provide insights into the model's ability to predict temperature based on wind-related exogenous variables and its real-world forecasting performance.
# Get the forecast for the test period
forecast_steps = len(test_df)
forecast_index = test_df.index # Use the time index of the test dataframe
forecast = sarimax_model_fit.get_forecast(steps=forecast_steps, exog=test_df[exog_vars])
# Plot the actual values
plt.plot(test_df['avg_temp'], label='Actual')
# Plot the forecasted values
plt.plot(forecast_index, forecast.predicted_mean, label='Forecast', color='red')
# Plot confidence intervals
#plt.fill_between(forecast_index, forecast.conf_int()['lower avg_temp'], forecast.conf_int()['upper avg_temp'], color='pink', alpha=0.3)
# Set labels and title
plt.xlabel('Time')
plt.ylabel('Average Temperature')
plt.title('SARIMAX Forecast on Test Data')
plt.legend()
# Show the plot
plt.show()
plt.savefig('../output/Linear Regression Analysis Multiple Predictors SARIMAX Forecast Results.png')
c:\Users\177119724\AppData\Local\Python\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.
<Figure size 2000x1500 with 0 Axes>
The plot presents a comparison between actual and forecasted average temperature values, providing a visual representation of the SARIMAX model's forecasting performance. Here is an explanation of the plot elements in relation to the provided conclusion: The actual average temperatures, shown in blue, display significant short-term fluctuations characterized by sharp peaks and troughs, indicative of the complex variability inherent in the temperature data over the test period. The forecasted temperatures, depicted in red, show the model's attempt to predict these temperatures. The forecast line is noticeably smoother and fails to match the magnitude and frequency of the observed temperature variations, particularly missing the sharpness of the peaks. At the start, the model seems to approximate the initial peak in temperature but does not fully capture its height or the subsequent drop, which is indicative of the model's limitations in predicting sudden changes. As the plot progresses, the model's forecasted values consistently overestimate the actual temperatures, with the forecast line situated above the actual temperature line most of the time. This overestimation aligns with the conclusion that the model tends to predict higher temperatures than actually observed, suggesting a systematic bias in the model.
The significant divergence between the forecasted and actual values, especially in the latter half of the plot where the model fails to predict the sharp rise and fall in temperatures, underscores the model's struggle to accurately capture the dynamic relationship between temperature and wind-related exogenous variables.
The absence of confidence interval bands in the plot means the uncertainty in the forecasts is not visualized, which could have provided additional insights into the model's predictive performance.
The graphic analysis reveals that the SARIMAX model struggles to accurately forecast the average temperature based on wind. While the initial peak is reasonably predicted, subsequent fluctuations in the data are not adequately captured. The forecast consistently exhibits a tendency to overestimate temperatures, with an average deviation of approximately 7 degrees higher than the observed values. This discrepancy suggests a systematic bias in the prediction, indicating that the SARIMAX model may not effectively account for certain influential factors related to wind-induced temperature variations. Consequently, the overall performance of the SARIMAX prediction falls short, highlighting the need for further refinement or exploration of alternative modeling approaches to enhance accuracy.
Exploring the dataset more comprehensively and fine-tuning the model based on a thorough understanding of the data patterns may contribute to enhancing the SARIMAX model's performance. This involves adjusting hyperparameters and making refinements to better align with the specific characteristics of the data, ultimately leading to more accurate predictions.
The next step involves quantitatively evaluating the SARIMAX model through key statistical metrics to measure accuracy and generalization capabilities. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) will be reviewed to assess model fit and complexity, while the Mean Squared Error (MSE) and Sum Squared Error (SSE) will quantify prediction error. These metrics are crucial for determining the model's forecasting efficacy and pinpointing areas that may require refinement.
# Get evaluation data
sarimax_aic = sarimax_model_fit.aic
sarimax_bic = sarimax_model_fit.bic
sarimax_mean_squared_error = sarimax_model_fit.mse
sarimax_sum_squared_error=sarimax_model_fit.sse
print(f'Akaike information criterion | AIC: {sarimax_aic}')
print(f'Bayesian information criterion | BIC: {sarimax_bic}')
print(f'Mean Squared Error | MSE: {sarimax_mean_squared_error}')
print(f'Sum Squared Error | SSE: {sarimax_sum_squared_error}')
Akaike information criterion | AIC: -315158.0914554971 Bayesian information criterion | BIC: -315076.3489885051 Mean Squared Error | MSE: 1.2663238330051534 Sum Squared Error | SSE: 82341.4409173271
The evaluation data derived from the SARIMAX model indicates a robust model fit, as suggested by the negative values in both AIC and BIC, with AIC at -315158.091 and BIC at -315076.349. These metrics, which penalize for over-complexity while rewarding good fit, suggest that the model's complexity is justified given the data. However, the Mean Squared Error (MSE) of 1.266 and the Sum Squared Error (SSE) of 82341.441 reveal that there are discrepancies between the forecasted and actual values, which may impact the model's real predictive accuracy.
5.3.3.1.5 Temperature and Wind Modeling over Time - Save Model (SARIMAX)
# Create a meaningful name with a timestamp
model_name = f"sarimax_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name, 'wb') as f:
pickle.dump(sarimax_model_fit, f)
print(f"The SARIMAX model has been saved to '{folder_path + model_name}'.")
The SARIMAX model has been saved to '../models/sarimax_model_20231229_1808.pkl'.
5.3.3.2.1 Temperature and Wind Modeling over Time - Select Model (Lazy Regressor)
“Progress isn’t made by early risers. It’s made by lazy men trying to find easier ways to do something.” ― Robert Heinlein
Usually, an "educated guess" is beeing made when it comes to model selection. The developer chooses some models, which he (or she) considers the best model for the given use case. After implementing the SARIMAX approach the library LazyPredict will now be used to determine the best models.
LazyPredict is a valuable tool for model selection, particularly when dealing with a large set of potential models and limited time for exhaustive evaluation. In the context of time series forecasting with exogenous variables, where various models such as SARIMAX, linear regression, and machine learning algorithms can be considered, LazyPredict streamlines the model comparison process by automatically assessing the performance of multiple models.
It provides a quick overview of model performances without the need for extensive manual tuning. In this specific use case, where accurate temperature predictions are crucial, the metric of interestis the Root Mean Squared Error (RMSE), as these metrics directly quantify the predictive accuracy of the models. By leveraging LazyPredict, one can efficiently identify promising model candidates and subsequently fine-tune the selected models for optimal performance based on the specific metric deemed most important for the forecasting task.
#from lazypredict.Supervised import LazyRegressor
#from sklearn.model_selection import train_test_split
## Define predictor variables
#X = df_temp_wind[['avg_windspd', 'avg_winddir','avg_windgust']]
## Define response variable
#y = df_temp_wind['avg_temp']
#X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=123)
#reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
#models,pred = reg.fit(X_train, X_test, y_train, y_test)
#print(models)
## takes too long to run
Results
| Model | Adjusted R-Squared | R-Squared | RMSE | Time Taken |
|---|---|---|---|---|
| XGBRegressor | 0.13 | 0.13 | 10.66 | 0.44 |
| HistGradientBoostingRegressor | 0.12 | 0.12 | 10.72 | 0.80 |
| LGBMRegressor | 0.12 | 0.12 | 10.72 | 0.17 |
| RandomForestRegressor | 0.12 | 0.12 | 10.72 | 69.52 |
| ExtraTreesRegressor | 0.10 | 0.10 | 10.83 | 14.12 |
| MLPRegressor | 0.08 | 0.08 | 10.95 | 30.44 |
| GradientBoostingRegressor | 0.08 | 0.08 | 10.95 | 14.94 |
| NuSVR | 0.07 | 0.07 | 11.02 | 127.68 |
| SVR | 0.06 | 0.06 | 11.09 | 129.99 |
| KNeighborsRegressor | 0.05 | 0.05 | 11.11 | 0.11 |
| BaggingRegressor | 0.05 | 0.05 | 11.13 | 7.32 |
| AdaBoostRegressor | 0.03 | 0.03 | 11.25 | 1.34 |
| LassoCV | 0.01 | 0.01 | 11.34 | 0.42 |
| LassoLarsCV | 0.01 | 0.01 | 11.34 | 0.06 |
| Lars | 0.01 | 0.01 | 11.34 | 0.17 |
| LarsCV | 0.01 | 0.01 | 11.34 | 0.15 |
| TransformedTargetRegressor | 0.01 | 0.01 | 11.34 | 0.03 |
| LassoLarsIC | 0.01 | 0.01 | 11.34 | 0.04 |
| LinearRegression | 0.01 | 0.01 | 11.34 | 0.02 |
| OrthogonalMatchingPursuitCV | 0.01 | 0.01 | 11.34 | 0.05 |
| Ridge | 0.01 | 0.01 | 11.34 | 0.02 |
| RidgeCV | 0.01 | 0.01 | 11.34 | 0.03 |
| ElasticNetCV | 0.01 | 0.01 | 11.34 | 0.30 |
| BayesianRidge | 0.01 | 0.01 | 11.34 | 0.03 |
| PoissonRegressor | 0.01 | 0.01 | 11.34 | 0.05 |
| SGDRegressor | 0.01 | 0.01 | 11.34 | 0.07 |
| OrthogonalMatchingPursuit | 0.01 | 0.01 | 11.35 | 0.02 |
| TweedieRegressor | 0.01 | 0.01 | 11.35 | 0.02 |
| GammaRegressor | 0.01 | 0.01 | 11.35 | 0.14 |
| ElasticNet | 0.01 | 0.01 | 11.36 | 0.03 |
| LassoLars | 0.01 | 0.01 | 11.39 | 0.02 |
| Lasso | 0.01 | 0.01 | 11.39 | 0.04 |
| HuberRegressor | 0.00 | 0.01 | 11.39 | 0.14 |
| LinearSVR | 0.00 | 0.00 | 11.42 | 0.10 |
| DummyRegressor | -0.00 | -0.00 | 11.42 | 0.01 |
| PassiveAggressiveRegressor | -0.20 | -0.20 | 12.51 | 0.04 |
| RANSACRegressor | -0.48 | -0.47 | 13.87 | 0.13 |
| DecisionTreeRegressor | -0.64 | -0.64 | 14.64 | 0.86 |
| ExtraTreeRegressor | -0.69 | -0.68 | 14.82 | 0.15 |
--> in the following steps the model with the lowest RMSE value, XGB Regressor, will be implemented.
5.3.3.2.2 Temperature and Wind Modeling over Time - Training and Validation (XGB Regressor)
# Extract features and target
features = ['avg_windspd', 'avg_winddir']
target = 'avg_temp'
X = df_temp_wind[features]
y = df_temp_wind[target]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create time-related features
def create_time_feature(data):
data['dayofmonth'] = data.index.day
data['dayofweek'] = data.index.dayofweek
data['quarter'] = data.index.quarter
data['month'] = data.index.month
data['year'] = data.index.year
data['dayofyear'] = data.index.dayofyear
data['weekofyear'] = data.index.isocalendar().week
create_time_feature(df_temp_wind)
df_temp_wind.head()
| avg_temp | avg_windspd | avg_winddir | avg_windgust | dayofmonth | dayofweek | quarter | month | year | dayofyear | weekofyear | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| run_datetime | |||||||||||
| 2015-07-15 00:00:00 | 287.389224 | 3.386380 | 80.302464 | 11.136197 | 15 | 2 | 3 | 7 | 2015 | 196 | 29 |
| 2015-07-15 01:00:00 | 287.378997 | 3.326687 | 76.866373 | 11.002795 | 15 | 2 | 3 | 7 | 2015 | 196 | 29 |
| 2015-07-15 02:00:00 | 287.388845 | 3.243494 | 76.258867 | 10.700595 | 15 | 2 | 3 | 7 | 2015 | 196 | 29 |
| 2015-07-15 03:00:00 | 287.427324 | 3.145505 | 78.299616 | 10.323983 | 15 | 2 | 3 | 7 | 2015 | 196 | 29 |
| 2015-07-15 04:00:00 | 287.489158 | 3.047607 | 84.632852 | 9.921157 | 15 | 2 | 3 | 7 | 2015 | 196 | 29 |
5.3.3.2.3 Temperature and Wind Modeling over Time - Fit Model (XGB Regressor)
The next action involves fitting an XGB Regressor to predict average temperature from wind speed and direction features. TimeSeriesSplit is employed for cross-validation, maintaining the temporal sequence of observations. This method evaluates the model's predictive performance on unseen data, ensuring its effectiveness for future forecasting. The chosen hyperparameters for the XGB Regressor aim to optimize the balance between model complexity and accuracy.
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, max_depth=3, learning_rate=0.1, alpha=0.5)
# Initialize time aware cross-validation
tscv = TimeSeriesSplit(n_splits=5)
# Perform cross-validation
for train_index, test_index in tscv.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
test_predictions_index = X_test.index
predictions = model.predict(X_test)
predictions_df = pd.DataFrame(predictions, index=test_predictions_index, columns=['vg_temp_prediction'])
5.3.3.2.4 Temperature and Wind Modeling over Time - Evaluation on Test set (XGB Regressor)
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': predictions_df['vg_temp_prediction']})
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(comparison_df['Actual'], comparison_df['Predicted'])
print(f'Mean Squared Error: {mse}')
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(comparison_df['Actual'], comparison_df['Predicted'])
print(f'Mean Absolute Error: {mae}')
Mean Squared Error: 125.31079889302593 Mean Absolute Error: 9.367344482232422
The XGB Regressor's performance on the test set yields a Mean Squared Error (MSE) of 125.31 and a Mean Absolute Error (MAE) of 9.37. These metrics indicate the average magnitude of the model's errors in the temperature predictions, with the values suggesting room for improvement in model accuracy.
For a clearer interpretation of the model's performance, a visual comparison of the predicted values against the actual data over time will be insightful.
# 'comparison_df' enthält die tatsächlichen und vorhergesagten Werte
plt.figure(figsize=(12, 6))
plt.plot(comparison_df.index, comparison_df['Actual'], label='Actual',)
plt.plot(comparison_df.index, comparison_df['Predicted'], label='Predicted', )
plt.xlabel('Zeit')
plt.ylabel('vg_temp')
plt.title('Predictions vs. Actual Values')
plt.legend()
plt.show()
The plot displays the actual and predicted average temperatures over a given time period, offering a direct visual assessment of the XGB Regressor model's forecasting performance.
Explanation:
The blue line, representing the actual temperatures, shows a complex pattern with significant variability over time, which is typical of temperature data that tends to be influenced by many factors. The orange line represents the model's predicted temperatures. It follows the general trend of the actual temperatures but with notable differences in peaks and valleys, suggesting that while the model captures the general direction of temperature changes, it may not capture the full extent of the variability. There are periods, such as the sharp dip early in the time series and the peak towards the end, where the predicted values diverge significantly from the actual values. These points indicate moments where the model fails to accurately predict extreme temperatures. The density of the lines suggests the model predictions are relatively consistent over time, but also that there might be a consistent underestimation or overestimation in certain periods which could be indicative of model bias. Conclusion:
The visual comparison suggests that the XGB Regressor has a moderate level of predictive accuracy, as it seems to follow the trend of the actual temperature but with discrepancies in magnitude and timing. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) values reported earlier reinforce this observation, indicating that while the model has predictive capabilities, there is room for improvement, particularly in capturing the amplitude of temperature fluctuations. This suggests that further tuning of the model, possibly with additional features or parameter adjustments, could enhance its performance. Moreover, exploring model residuals and conducting error analysis could provide deeper insights into the specific conditions under which the model performs well or poorly.
5.3.3.2.5 Temperature and Wind Modeling over Time - Save Model (XGB Regressor)
# Create a meaningful name with a timestamp
model_name = f"xgb_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name, 'wb') as f:
pickle.dump(model, f)
print(f"The SARIMAX model has been saved to '{folder_path + model_name}'.")
The SARIMAX model has been saved to '../models/xgb_model_20231229_1808.pkl'.
Temperature prediction involves exploring the relationships between temperature and time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.
Is it possible to build an accurate regression model to predict temperature based on historical data? This involves exploring the relationships between temperature and windspeed over time (daytime, day, season, etc.). The result of this regression analysis is the identification of a trend in the weather data as well the prediction of the temperature for the next x days, weeks, etc. based on the historical weather.
# Cloning DataFrame to manipulate it for the forecast prediction
forecast_data = pd.DataFrame(df)
# Convert timestamp to datetime format
forecast_data['timestamp'] = pd.to_datetime(forecast_data['run_datetime'])
# Extract day of the year and hour of the day
forecast_data['day_of_year'] = forecast_data['timestamp'].dt.dayofyear
forecast_data['hour'] = forecast_data['timestamp'].dt.hour
# Assuming 'avg_temp' is the target variable
X = forecast_data[['day_of_year', 'hour']]
y = forecast_data['avg_temp']
# Create models
linear_regression_model = make_pipeline(LinearRegression())
gradient_boosting_model = GradientBoostingRegressor()
sgd_regressor_model = make_pipeline(StandardScaler(), SGDRegressor())
svr_model = SVR()
# Split the data into training and testing sets
train_mask = (forecast_data['timestamp'].dt.year >= 2015) & (forecast_data['timestamp'].dt.year <= 2021)
test_mask = (forecast_data['timestamp'].dt.year == 2022)
X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]
# Train models
linear_regression_model_fit = linear_regression_model.fit(X_train, y_train)
gradient_boosting_model_fit = gradient_boosting_model.fit(X_train, y_train)
sgd_regressor_model_fit = sgd_regressor_model.fit(X_train, y_train)
svr_model_fit = svr_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred_linear_regression = linear_regression_model_fit.predict(X_test)
y_pred_gradient_boosting = gradient_boosting_model_fit.predict(X_test)
y_pred_sgd_regressor = sgd_regressor_model_fit.predict(X_test)
y_pred_svr = svr_model_fit.predict(X_test)
# Calculate residuals
residuals_linear_regression = y_test - y_pred_linear_regression
residuals_gradient_boosting = y_test - y_pred_gradient_boosting
residuals_sgd_regressor = y_test - y_pred_sgd_regressor
residuals_svr = y_test - y_pred_svr
# Calculate MSE, MAE, and Mean Residual Error for each model
mse_linear_regression = mean_squared_error(y_test, y_pred_linear_regression)
mae_linear_regression = mean_absolute_error(y_test, y_pred_linear_regression)
mean_residual_error_linear_regression = np.mean(residuals_linear_regression)
mse_gradient_boosting = mean_squared_error(y_test, y_pred_gradient_boosting)
mae_gradient_boosting = mean_absolute_error(y_test, y_pred_gradient_boosting)
mean_residual_error_gradient_boosting = np.mean(residuals_gradient_boosting)
mse_sgd_regressor = mean_squared_error(y_test, y_pred_sgd_regressor)
mae_sgd_regressor = mean_absolute_error(y_test, y_pred_sgd_regressor)
mean_residual_error_sgd_regressor = np.mean(residuals_sgd_regressor)
mse_svr = mean_squared_error(y_test, y_pred_svr)
mae_svr = mean_absolute_error(y_test, y_pred_svr)
mean_residual_error_svr = np.mean(residuals_svr)
# Visualize the predictions, residuals, and tolerance zone
plt.figure(figsize=(15, 20))
# Linear Regression subplot
plt.subplot(4, 2, 1)
plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
plt.plot(forecast_data[test_mask]['timestamp'], y_pred_linear_regression, label=f'Linear Regression Predicted\nMSE: {mse_linear_regression:.2f}, MAE: {mae_linear_regression:.2f}, Mean Residual: {mean_residual_error_linear_regression:.2f}')
plt.xlabel('Timestamp')
plt.ylabel('Temperature')
plt.title('Linear Regression')
plt.legend()
# Linear Regression Residuals subplot
plt.subplot(4, 2, 2)
plt.scatter(y_pred_linear_regression, residuals_linear_regression)
plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
plt.xlabel('Predicted Temperature')
plt.ylabel('Residuals')
plt.title('Linear Regression Residuals')
plt.legend()
# Gradient Boosting Regressor subplot
plt.subplot(4, 2, 3)
plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
plt.plot(forecast_data[test_mask]['timestamp'], y_pred_gradient_boosting, label=f'Gradient Boosting Predicted\nMSE: {mse_gradient_boosting:.2f}, MAE: {mae_gradient_boosting:.2f}, Mean Residual: {mean_residual_error_gradient_boosting:.2f}')
plt.xlabel('Timestamp')
plt.ylabel('Temperature')
plt.title('Gradient Boosting Regressor')
plt.legend()
# Gradient Boosting Residuals subplot
plt.subplot(4, 2, 4)
plt.scatter(y_pred_gradient_boosting, residuals_gradient_boosting)
plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
plt.xlabel('Predicted Temperature')
plt.ylabel('Residuals')
plt.title('Gradient Boosting Residuals')
plt.legend()
# SGD Regressor subplot
plt.subplot(4, 2, 5)
plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
plt.plot(forecast_data[test_mask]['timestamp'], y_pred_sgd_regressor, label=f'SGD Regressor Predicted\nMSE: {mse_sgd_regressor:.2f}, MAE: {mae_sgd_regressor:.2f}, Mean Residual: {mean_residual_error_sgd_regressor:.2f}')
plt.xlabel('Timestamp')
plt.ylabel('Temperature')
plt.title('SGD Regressor')
plt.legend()
# SGD Regressor Residuals subplot
plt.subplot(4, 2, 6)
plt.scatter(y_pred_sgd_regressor, residuals_sgd_regressor)
plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
plt.xlabel('Predicted Temperature')
plt.ylabel('Residuals')
plt.title('SGD Regressor Residuals')
plt.legend()
# SVR subplot
plt.subplot(4, 2, 7)
plt.plot(forecast_data[test_mask]['timestamp'], y_test, label='Actual')
plt.plot(forecast_data[test_mask]['timestamp'], y_pred_svr, label=f'SVR Predicted\nMSE: {mse_svr:.2f}, MAE: {mae_svr:.2f}, Mean Residual: {mean_residual_error_svr:.2f}')
plt.xlabel('Timestamp')
plt.ylabel('Temperature')
plt.title('Support Vector Regressor')
plt.legend()
# SVR Residuals subplot
plt.subplot(4, 2, 8)
plt.scatter(y_pred_svr, residuals_svr)
plt.axhline(y=0, color='black', linestyle='--', label='Zero Residuals Line')
plt.xlabel('Predicted Temperature')
plt.ylabel('Residuals')
plt.title('Support Vector Regressor Residuals')
plt.legend()
plt.tight_layout()
plt.show()
plt.savefig('../output/Linear Regression Analysis Prediction Forecast Results.png')
<Figure size 2000x1500 with 0 Axes>
The series of plots provided depict the performance of various regression models in predicting temperature over time and their respective residuals. Here's a detailed explanation of the visualized results:
Linear Regression:
The top-left plot shows the actual temperatures against the predicted temperatures from the Linear Regression model over time. The model appears to capture the overall trend but with notable deviations, particularly in the peaks and troughs. The top-right plot of Linear Regression residuals displays the difference between the predicted and actual temperatures. The residuals should ideally be randomly scattered around the zero line; however, a pattern appears to show variance increasing with temperature, indicating heteroscedasticity.
Gradient Boosting Regressor:
The second row left plot illustrates the Gradient Boosting model's predictions alongside the actual data. This model seems to track the temperature changes more closely than the Linear Regression model, with fewer and smaller deviations. The residuals for the Gradient Boosting model, shown in the second row right plot, are more densely packed around the zero line compared to the Linear Regression, suggesting better performance in capturing the variance of the temperature data.
SGD Regressor:
In the third row left plot, the SGD Regressor's predictions are compared with the actual temperatures. This model's predictions seem to follow a linear trend that does not account for the seasonal fluctuations as effectively as the Gradient Boosting model. The residuals for the SGD Regressor, depicted in the third row right plot, exhibit a similar pattern to the Linear Regression, with a spread that increases with the predicted temperature values.
Support Vector Regressor (SVR):
The bottom-left plot for the SVR presents its predictions, which appear to smooth out the actual temperature fluctuations, potentially missing some of the more nuanced changes. The SVR residuals, shown in the bottom-right plot, display a spread that suggests the SVR may not be capturing the full range of temperature variability, as evidenced by the wider spread of residuals compared to the Gradient Boosting model.
Conclusion: The visual analysis corroborates the conclusion that a satisfactory approximation of temperature trends over a year in Bancroft, Canada, is feasible with the use of linear regression models. While exact daily weather prediction remains challenging, the models demonstrate the ability to predict the general progression of temperatures with varying degrees of accuracy. Among the models, the Gradient Boosting Regressor shows a promising balance of capturing both the general trend and the variance in temperature, as reflected in its tighter residuals distribution. These results are consistent with the understanding that while perfect predictions are not attainable, linear regression models can provide valuable forecasts for temperature trends.
# Create a meaningful name with a timestamp
model_name_lr_3 = f"lr_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_gb_3 = f"gb_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_sgd_3 = f"sgd_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_svr_3 = f"svr_model_fit_3_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name_lr_3, 'wb') as f:
pickle.dump(linear_regression_model_fit, f)
with open(folder_path + model_name_gb_3, 'wb') as f:
pickle.dump(gradient_boosting_model_fit, f)
with open(folder_path + model_name_sgd_3, 'wb') as f:
pickle.dump(sgd_regressor_model_fit, f)
with open(folder_path + model_name_svr_3, 'wb') as f:
pickle.dump(svr_model_fit, f)
While logistic regression uses mathematical regression techniques, the result is in fact a binary classification. For this reason the logistic regression analysis is conducted in between the linear regression and classification chapters, to "build a bridge" between the topics, which helps getting a better understanding of the told data story.
The analysis for extreme weather event prediction by temperature is conducted using logistic regression techniques. It is used to answer the research question: Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature (or alternatively windspeed) ranges? This question involves categorizing temperature into distinct classes and determining whether certain temperature ranges are indicative of extreme weather events. The outcome of this analysis will contribute to whether an observation is an extreme weather event, providing valuable insights into the correlation between temperature and specific meteorological phenomena. For example, it could be assumed, that extremely cold temperatures are more common in extreme weather events. Whether this statement is true, is to be analyzed in this chapter. To achieve this, this tasks also includes to find out, what the critical temperature thresholds associated with the onset or intensification of different weather events are. This question aims to identify temperature breakpoints that signify transitions between different weather conditions. Understanding these thresholds is essential for developing early warning systems and enhancing preparedness for weather-related events.
label1_counts = df['label1'].value_counts()
print(label1_counts)
label1 1 52633 0 12712 Name: count, dtype: int64
# Create Logistic Regressuin Model
logreg_model = LogisticRegression()
# Select the features (independent variable)
X = df[['avg_temp']]
# Select the dependent variable
y = df['label1']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features (reshape to 2D array)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.values.reshape(-1, 1))
X_test_scaled = scaler.transform(X_test.values.reshape(-1, 1))
# Print the shapes of the sets
print("X_train shape:", X_train_scaled.shape)
print("X_test shape:", X_test_scaled.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
X_train shape: (52276, 1) X_test shape: (13069, 1) y_train shape: (52276,) y_test shape: (13069,)
# Fit Logistic Regression model
logreg_model.fit(X_train_scaled, y_train)
# Add a constant to the features for statsmodels
X_train_sm = sm.add_constant(X_train_scaled)
# Create a logistic regression model using statsmodels
logit_model_sm = sm.Logit(y_train, X_train_sm)
# Fit the logit sm model
logit_model = logit_model_sm.fit()
Optimization terminated successfully.
Current function value: 0.375341
Iterations 7
from yellowbrick.classifier import ConfusionMatrix, ClassificationReport
# Print AIC
aic_value = logit_model.aic
print("AIC:", aic_value)
print("Label 0: Extreme Weather Event \n Label 1: Blue Sky Day")
# Make predictions on the test set using the result object
X_test_sm = sm.add_constant(X_test_scaled)
y_prob = logit_model.predict(X_test_sm) # Probabilities
y_pred = (y_prob > 0.5).astype(int) # Convert probabilities to binary predictions
# Confusion Matrix as Diagram with custom colors
cm_viz = ConfusionMatrix(logreg_model, classes=[0, 1], cmap='RdYlGn')
cm_viz.score(X_test_scaled, y_test)
cm_viz.show()
plt.show()
plt.savefig('../output/Logistic Regression Predicting WEP by Temperature Confussion Matrix.png')
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
AIC: 39246.639678249965 Label 0: Extreme Weather Event Label 1: Blue Sky Day
Classification Report:
precision recall f1-score support
0 0.39 0.20 0.26 2515
1 0.83 0.93 0.88 10554
accuracy 0.79 13069
macro avg 0.61 0.56 0.57 13069
weighted avg 0.75 0.79 0.76 13069
<Figure size 800x550 with 0 Axes>
The confusion matrix for the Logistic Regression model offers a visual summary of the model's performance in classifying extreme weather events versus blue sky days.
Explanation of the Confusion Matrix:
The matrix has two classes: Label 0 (Extreme Weather Event) and Label 1 (Blue Sky Day). The top left quadrant (true negative region) with 502 instances indicates the number of extreme weather events correctly predicted by the model. The bottom right quadrant (true positive region) with 9776 instances shows the number of blue sky days correctly identified. The top right quadrant (false positive region) with 2013 instances represents the extreme weather events incorrectly classified as blue sky days. The bottom left quadrant (false negative region) with 778 instances denotes the blue sky days that were wrongly predicted as extreme weather events. Conclusion:
The model's performance, as reflected in the confusion matrix, aligns with the previously stated conclusion. The high number of false positives (2013) compared to true negatives (502) suggests that the model is inclined to predict blue sky days over extreme weather events. This results in a low recall for Label 0, meaning that the model missed a significant number of actual extreme weather events, classifying them as non-events. This could have serious implications in a real-world setting where predicting extreme weather accurately is crucial.
Conversely, the model has a high number of true positives (9776) for blue sky days, with fewer false negatives (778), leading to a high recall for Label 1. This indicates that the model is more reliable in predicting non-extreme weather conditions.
The high false positive rate and the low recall for extreme weather events underscore the need for model improvement. Possible steps include incorporating more predictive features beyond temperature, applying different thresholding strategies for classification, or using a model that better captures the complexity of extreme weather events. The subsequent visualization in a logistic regression graph could provide further insights into the model's performance and help in understanding the probability thresholds used for classification.
# Visualize results
plt.figure(figsize=(12, 6))
# Plot the data points on the logistic regression curve based on probabilities
y_vals_prob = logit_model.predict(sm.add_constant(scaler.transform(X_test.values.reshape(-1, 1))))
plt.scatter(X_test, y_test, color='blue', label='Actual Labels, yi')
plt.scatter(X_test, y_vals_prob, color='red', marker='o', label='Predicted Probabilities, p(Xi)')
# Plot the linear part of the logistic regression curve
x_vals = np.linspace(X_test.min(), X_test.max(), 100)
linear_combination = logit_model.params[0] + logit_model.params[1] * scaler.transform(x_vals.reshape(-1, 1))
# Logistic function (sigmoid) to convert linear combination to probability
logistic_function = 1 / (1 + np.exp(-linear_combination))
plt.plot(x_vals, linear_combination, color='green', linewidth=2, label='Logit, f(x)')
plt.plot(x_vals, logistic_function, color='orange', linewidth=2, label='Logistic Regression Curve, p(x)')
# Plot the threshold line at y=0.5
plt.axhline(y=0.5, color='purple', linestyle='--', label='Threshold = 0.5')
# Set y-axis limits
plt.ylim(-0.01, 1.01)
# Display AIC value
plt.text(plt.xlim()[1], plt.ylim()[1], f'AIC = {aic_value:.2f}', ha='right', va='top', fontsize=10, color='purple')
plt.xlabel('Temperature')
plt.ylabel('Probability of Extreme Event')
plt.suptitle('Logistic Regression Analysis')
plt.title('Label 0: Extreme Weather Event \n Label 1: Blue Sky Day')
plt.legend()
plt.show()
plt.savefig('../output/Logistic Regression Analysis Predicting WEP by Temperature Results.png')
<Figure size 800x550 with 0 Axes>
The logistic regression analysis graph shows the relationship between temperature and the probability of an extreme weather event.
Explanation of the Plot:
Blue dots represent the actual labels from the test set, where a dot at 1 on the y-axis indicates an actual extreme weather event (Label 0), and a dot at 0 indicates a blue sky day (Label 1). Red dots are the predicted probabilities (p(Xi)) that the logistic regression model assigns to each temperature value, reflecting the likelihood of an extreme weather event occurring. The green line represents the linear combination of predictors (the logit function), which is then transformed by the logistic function to yield probabilities. The orange line is the logistic regression curve (p(x)), which represents the probability of an extreme weather event across different temperatures. The purple dashed line indicates the threshold of 0.5, used to classify a prediction as an extreme weather event or a blue sky day. Predicted probabilities above this line are classified as extreme weather events, while those below are classified as blue sky days. Conclusion:
Aligning with the conclusion, the graph demonstrates that temperature alone, as a predictor, does not suffice for accurately classifying extreme weather events. There is a significant overlap in the temperature ranges for both event types, resulting in a high rate of false positives and a low recall for predicting extreme weather events. This indicates that logistic regression with temperature as the sole predictor is not robust enough for this classification task. The performance could potentially be improved by including additional predictors, tuning hyperparameters, or using regularization techniques. However, considering the overlapping distribution of temperature values for both event types, logistic regression may not be the most effective method for this problem, and alternative modeling approaches should be considered.
# Create a meaningful name with a timestamp
model_name_logreg= f"logreg_model_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name_logreg, 'wb') as f:
pickle.dump(logreg_model, f)
Research Questions:
Extreme Weather Events: Extreme weather events such as stroms or snowfall should be predictet. This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.
Weather Event and Pattern Classification: Prediction of different weather patterns based on multivariate weather data. This involves using multiclass classification algorithms. The results of this classification analysis is the predection of certain weather events based on the current weather data and a model that was trained on historical weather data.
Predicting extreme weather events is paramount for safeguarding public safety, protecting infrastructure, and ensuring food security. Early warnings enable timely evacuations and preparations, reducing the impact on lives and minimizing damage to critical structures. Additionally, forecasting various weather conditions beyond extremes is vital for everyday planning, energy management, transportation, and sustainable resource use. Accurate weather predictions contribute to resilient communities, supporting both immediate disaster response and long-term planning for a wide range of societal needs.
Extreme Weather Events in Binary Classification: Is it possible to classify and predict extreme weather events such as storms? This involves training a binary classification model to identify patterns indicative of extreme events. The result of this classification analysis is the prediction of extreme weather events based on the current weather data and a model that was trained on historical weather data.
In the first step, a binary classification of extreme weather events versus blue sky weather is conducted.
Blue sky Events can be identified by the variable "label1". The variable has the following manifestations:
# show data as pie chart
labels = ['Blue Sky', 'Extreme Weather']
sizes = df['label1'].value_counts()
colors = [ 'lightskyblue','lightcoral']
explode = (0.1, 0) # explode first
# Plotting the pie chart
plt.figure(figsize=(6, 6))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct=lambda p: '{:.0f} ({:.1f}%)'.format(p * sum(sizes) / 100, p),
shadow=True, startangle=140)
plt.axis('equal')
plt.title('Blue Sky Events vs. Extreme Weather Events')
plt.show()
plt.savefig('../output/Binary Classification Blue Sky vs Extreme Weather Events.png')
<Figure size 800x550 with 0 Axes>
The pie chart illustrates the distribution of weather events classified as either 'Blue Sky' or 'Extreme Weather' within a dataset.
Explanation of the Plot:
The large blue section, representing 'Blue Sky' events, accounts for the majority of the data with 52,633 cases, which is 80.5% of the total events in the dataset. The smaller red section, depicting 'Extreme Weather' events, comprises 12,712 cases, amounting to 19.5% of the events. The 'Blue Sky' category is highlighted with a lighter shade and takes up most of the chart, emphasizing its prevalence over 'Extreme Weather' events. The explode feature in the pie chart slightly offsets the 'Extreme Weather' section, drawing attention to it and visually reinforcing its significance despite being the minority class. The autopct parameter provides a clear annotation of the percentage each category represents, along with the absolute number of events, offering an immediate understanding of the proportion between the two classifications.
Conclusion:
The visualization effectively communicates the imbalance in the dataset, with a significantly larger number of 'Blue Sky' days compared to 'Extreme Weather' events. This imbalance is a common challenge in binary classification tasks, as it may lead to a model that is biased towards the majority class. In practice, this could result in a higher rate of false negatives for 'Extreme Weather' predictions, which, as previously discussed, can have serious implications for public safety and disaster preparedness. The chart supports the necessity for careful consideration of class distribution when training a binary classification model to ensure it does not underrepresent the less frequent but critical 'Extreme Weather' events.
First, the used data will be preprocessed:
#copy df
df_class =df
# Calculate the mean value of the column
mean_value = df_class["avg_pressure_change"].mean()
#set "avg_temp_change" to 0 in the first row, as there is no temp change
df_class.at[0, 'avg_temp_change'] = 0
#drop columns, which are not numeric and not needed
df_class = df_class.drop(columns = ['wep','wind_direction_label', 'run_datetime','label2','avg_pressure_change','label0', 'display_label'])
#show adjusted df
df_class.head()
| avg_temp | avg_temp_celsius | min_wet_bulb_temp | avg_dewpoint | avg_temp_change | avg_windspd | max_windgust | avg_winddir | avg_winddir_sin | avg_winddir_cos | max_cumulative_precip | max_snow_density_6 | max_cumulative_snow | max_cumulative_ice | label1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 287.389224 | 14.239224 | 280.809506 | 280.735246 | 0.000000 | 3.386380 | 14.899891 | 80.302464 | -0.981653 | 0.190676 | 2.009 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 287.378997 | 14.228997 | 280.809506 | 280.414058 | -0.010227 | 3.326687 | 14.899891 | 76.866373 | 0.994736 | 0.102466 | 1.209 | 0.0 | 0.0 | 0.0 | 1 |
| 2 | 287.388845 | 14.238845 | 280.809506 | 280.187074 | 0.009848 | 3.243494 | 14.899891 | 76.258867 | 0.758262 | 0.651950 | 0.400 | 0.0 | 0.0 | 0.0 | 1 |
| 3 | 287.427324 | 14.277324 | 280.809506 | 280.049330 | 0.038479 | 3.145505 | 14.899891 | 78.299616 | 0.237897 | -0.971290 | 0.000 | 0.0 | 0.0 | 0.0 | 1 |
| 4 | 287.489158 | 14.339158 | 280.809506 | 279.980697 | 0.061834 | 3.047607 | 14.702229 | 84.632852 | 0.189006 | -0.981976 | 0.000 | 0.0 | 0.0 | 0.0 | 1 |
df_class["label1"]
0 1
1 1
2 1
3 1
4 1
..
65340 0
65341 1
65342 1
65343 1
65344 1
Name: label1, Length: 65345, dtype: int64
# checking for missing values in the data
df_class.isnull().sum()
avg_temp 0 avg_temp_celsius 0 min_wet_bulb_temp 0 avg_dewpoint 0 avg_temp_change 0 avg_windspd 0 max_windgust 0 avg_winddir 0 avg_winddir_sin 0 avg_winddir_cos 0 max_cumulative_precip 0 max_snow_density_6 0 max_cumulative_snow 0 max_cumulative_ice 0 label1 0 dtype: int64
5.4.1.1.1 Binary Classification - Select Model
Choose Classifier Models with Lazy Classifier:
#from lazypredict.Supervised import LazyClassifier
## Define predictor variables
X = df_class.drop(columns = ['label1', 'max_snow_density_6'])
## Define response variable
y = df_class['label1']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#clf = LazyClassifier(verbose=0, ignore_warnings=False, custom_metric=None)
#models, pred = clf.fit(X_train, X_test, y_train, y_test)
#print(models)
## take too long to run
Results
| Model | Accuracy | Balanced Accuracy | ROC AUC | F1 Score | Time Taken |
|---|---|---|---|---|---|
| ExtraTreesClassifier | 0.99 | 0.99 | 0.99 | 0.99 | 4.13 |
| XGBClassifier | 0.99 | 0.99 | 0.99 | 0.99 | 0.63 |
| RandomForestClassifier | 0.99 | 0.99 | 0.99 | 0.99 | 29.54 |
| LGBMClassifier | 0.99 | 0.99 | 0.99 | 0.99 | 0.67 |
| BaggingClassifier | 0.99 | 0.98 | 0.98 | 0.99 | 11.56 |
| DecisionTreeClassifier | 0.99 | 0.98 | 0.98 | 0.99 | 1.54 |
| AdaBoostClassifier | 0.99 | 0.98 | 0.98 | 0.99 | 14.65 |
| LinearSVC | 0.99 | 0.97 | 0.97 | 0.99 | 1.62 |
| CalibratedClassifierCV | 0.99 | 0.97 | 0.97 | 0.99 | 1.48 |
| LogisticRegression | 0.98 | 0.97 | 0.97 | 0.98 | 0.43 |
| QuadraticDiscriminantAnalysis | 0.98 | 0.97 | 0.97 | 0.98 | 0.24 |
| GaussianNB | 0.97 | 0.97 | 0.97 | 0.97 | 0.11 |
| SVC | 0.99 | 0.97 | 0.97 | 0.99 | 22.36 |
| SGDClassifier | 0.99 | 0.97 | 0.97 | 0.99 | 0.16 |
| ExtraTreeClassifier | 0.97 | 0.96 | 0.96 | 0.97 | 0.12 |
| Perceptron | 0.98 | 0.95 | 0.95 | 0.98 | 0.20 |
| PassiveAggressiveClassifier | 0.98 | 0.95 | 0.95 | 0.98 | 0.23 |
| KNeighborsClassifier | 0.97 | 0.92 | 0.92 | 0.96 | 7.63 |
| BernoulliNB | 0.89 | 0.92 | 0.92 | 0.90 | 0.14 |
| NearestCentroid | 0.83 | 0.86 | 0.86 | 0.85 | 0.13 |
| LinearDiscriminantAnalysis | 0.91 | 0.76 | 0.76 | 0.90 | 0.37 |
| RidgeClassifier | 0.89 | 0.72 | 0.72 | 0.87 | 0.11 |
| RidgeClassifierCV | 0.89 | 0.72 | 0.72 | 0.87 | 0.13 |
| DummyClassifier | 0.81 | 0.50 | 0.50 | 0.72 | 0.06 |
The LazyClassifier indicates that "ExtraTreesClassifier," "XGBClassifier," "RandomForestClassifier," and "LGBMClassifier" are the top-performing classifiers for the given use case. Consequently, these classifiers will be implemented in the subsequent steps of the project. Their selection is based on their demonstrated effectiveness in handling the specific characteristics and requirements of the task at hand, as identified by the LazyClassifier's assessment. This strategic choice aims to leverage the strengths of these classifiers to achieve optimal performance and reliable results in the context of the project.
5.4.1.1.2 Binary Classification - Training and Validation
# Split the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the classifiers
classifiers = {
'ExtraTrees': ExtraTreesClassifier(),
'XGBoost': XGBClassifier(),
'LGBM': LGBMClassifier(verbose=-1),
'RandomForest': RandomForestClassifier()
}
5.4.1.1.3 Binary Classification - Fit Model and Evaluate on Test Set
# Train the classifiers and collect accuracies
accuracies = {}
for idx, (name, clf) in enumerate(classifiers.items()):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies[name] = accuracy
print(f'{name} Accuracy: {accuracy}')
ExtraTrees Accuracy: 0.9931134746346316 XGBoost Accuracy: 0.9941847119136888 LGBM Accuracy: 0.9935725763256561 RandomForest Accuracy: 0.9935725763256561
5.4.1.1.4 Binary Classification - Visualize Results
# Set up subplots
fig, axes = plt.subplots(1, len(classifiers), figsize=(18, 5))
# Iterate over classifiers and plot confusion matrices
for ax, (name, model) in zip(axes, classifiers.items()):
# Create a pipeline with StandardScaler and the current model
pipe = make_pipeline(StandardScaler(), model)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot heatmap
sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square=True, cmap='Blues', ax=ax)
# Set labels and title
ax.set_xlabel('Predicted Label')
ax.set_ylabel('True Label')
ax.set_title(f'{name} Confusion Matrix')
# Adjust layout for better spacing
plt.tight_layout()
# Show the plot
plt.show()
plt.savefig('../output/Binary Classification Confussion Matrix.png')
<Figure size 800x550 with 0 Axes>
The visualization consists of four confusion matrices, each representing the performance of a different classification model used for binary classification of weather events: ExtraTrees, XGBoost, LightGBM (LGBM), and RandomForest.
Explanation of the Confusion Matrices:
Top Left (ExtraTrees Confusion Matrix):
True Negatives (TN): 2,448 instances where the model correctly predicted the negative class (no extreme weather event). False Positives (FP): 67 instances where the model incorrectly predicted the positive class (extreme weather). False Negatives (FN): 20 instances where the model incorrectly predicted the negative class. True Positives (TP): 10,534 instances where the model correctly predicted the positive class (extreme weather). Top Right (XGBoost Confusion Matrix):
TN: 2,453 FP: 62 FN: 17 TP: 10,537 Bottom Left (LGBM Confusion Matrix):
TN: 2,446 FP: 69 FN: 13 TP: 10,541 Bottom Right (RandomForest Confusion Matrix):
TN: 2,438 FP: 77 FN: 14 TP: 10,540 Conclusion:
All models demonstrate high accuracy, with a significant majority of instances correctly classified, which is indicative of their ability to discriminate between the two classes effectively. The LGBM classifier shows the least number of Type II errors, signifying its strength in identifying true extreme weather events with minimal misses. Conversely, the XGBoost classifier presents with the lowest Type I errors, suggesting it is more conservative in predicting extreme weather, thus minimizing false alarms. In practical applications, Type I errors can be particularly critical as they represent missed predictions of extreme weather, which are crucial for timely warnings and safety measures. Therefore, the XGBoost classifier might be preferred in scenarios where the cost of missing an actual extreme weather event is high. Each of these models offers a trade-off between sensitivity to detecting true events and specificity in avoiding false alarms, which needs to be carefully balanced according to the application's requirements and the consequences of prediction errors.
# Iterate over classifiers and plot confusion matrices
for ax, (name, model) in zip(axes, classifiers.items()):
# Create a pipeline with StandardScaler and the current model
pipe = make_pipeline(StandardScaler(), model)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
# Calculate classification report with more decimal places
report = classification_report(y_test, y_pred, digits=4)
print(name, ":\n", report)
ExtraTrees :
precision recall f1-score support
0 0.9911 0.9742 0.9826 2515
1 0.9939 0.9979 0.9959 10554
accuracy 0.9933 13069
macro avg 0.9925 0.9860 0.9892 13069
weighted avg 0.9933 0.9933 0.9933 13069
XGBoost :
precision recall f1-score support
0 0.9931 0.9753 0.9842 2515
1 0.9942 0.9984 0.9963 10554
accuracy 0.9940 13069
macro avg 0.9936 0.9869 0.9902 13069
weighted avg 0.9940 0.9940 0.9939 13069
LGBM :
precision recall f1-score support
0 0.9947 0.9726 0.9835 2515
1 0.9935 0.9988 0.9961 10554
accuracy 0.9937 13069
macro avg 0.9941 0.9857 0.9898 13069
weighted avg 0.9937 0.9937 0.9937 13069
RandomForest :
precision recall f1-score support
0 0.9947 0.9714 0.9829 2515
1 0.9932 0.9988 0.9960 10554
accuracy 0.9935 13069
macro avg 0.9940 0.9851 0.9894 13069
weighted avg 0.9935 0.9935 0.9935 13069
The classification reports provide an evaluation of the performance of different models.
ExtraTrees: The ExtraTrees model demonstrates high precision and recall for both classes, achieving an accuracy of 99.30%. The precision, recall, and F1-score for both extreme weather events (0) and blue sky events (1) are consistently high, indicating robust performance across both classes.
XGBoost: The XGBoost model exhibits excellent precision, recall, and F1-score for both classes, resulting in an overall accuracy of 99.40%. Similar to ExtraTrees, it shows strong performance in correctly classifying both extreme weather and blue sky events.
LightGBM: The LightGBM model achieves a high accuracy of 99.37%, with impressive precision, recall, and F1-score for both classes. Notably, it maintains a high recall for extreme weather events (0), ensuring that a significant proportion of these events are correctly identified.
RandomForest: The RandomForest model performs well, achieving an accuracy of 99.32%. It shows strong precision, recall, and F1-score for both extreme weather events (0) and blue sky events (1), indicating reliable performance across different weather scenarios.
In summary, all four models—ExtraTrees, XGBoost, LightGBM, and RandomForest—demonstrate robust performance in classifying weather events, with high accuracy and consistent precision and recall metrics across the evaluated classes.
5.4.1.1.5 Binary Classification - Save Model
for ax, (name, model) in zip(axes, classifiers.items()):
model_name= f"{name}_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name, 'wb') as f:
pickle.dump(model, f)
After successfully predicting extreme weather and blue sky day weather events, a key result of this research should be the prediction of specific extreme weather events. Once it is determined, that an observation is an extreme weather event, it's important to analyse what specific kind of extreme weather event it is. These results can then be used by scientists and governmental institutions to take countermeasures to prevent damage and minimize the risk for a weather event to be hazardous. The analysis for the classification of specific weather events and patterns is conducted using multiclass classification techniques. The research question to be answered ist: Is it possible to categorize and predict different extreme weather events based on multivariate weather data? This involves using multiclass classification algorithms. The results of this classification analysis is the prediction of certain weather events based on the current weather data and a model that was trained on historical weather data.
# Creating new dataframe that only includes extreme weather observations
df_extreme_weather = pd.DataFrame(df)
df_extreme_weather = df_extreme_weather[df_extreme_weather['label1'] != 1]
# Dropping NaN
df_extreme_weather = df_extreme_weather.dropna()
df_extreme_weather
| run_datetime | wep | avg_temp | avg_temp_celsius | min_wet_bulb_temp | avg_dewpoint | avg_temp_change | avg_windspd | max_windgust | avg_winddir | ... | wind_direction_label | max_cumulative_precip | max_snow_density_6 | max_cumulative_snow | max_cumulative_ice | avg_pressure_change | label0 | label1 | label2 | display_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2238 | 2015-10-16 13:00:00 | Moderate rain | 276.935854 | 3.785854 | 271.085195 | 273.479160 | -0.196776 | 2.610431 | 12.095491 | 302.159235 | ... | Northwest | 6.115 | 0.000000 | 24.979 | 0.0 | 25.754899 | 4 | 0 | 3.0 | Moderate rain |
| 2239 | 2015-10-16 14:00:00 | Moderate rain | 276.763328 | 3.613328 | 271.085195 | 273.237344 | -0.172526 | 2.657227 | 12.095491 | 302.695638 | ... | Northwest | 6.215 | 0.000000 | 25.897 | 0.0 | 26.257896 | 4 | 0 | 3.0 | Moderate rain |
| 2240 | 2015-10-16 15:00:00 | Moderate rain | 276.597296 | 3.447296 | 271.085195 | 272.988734 | -0.166032 | 2.700702 | 12.095491 | 303.377860 | ... | Northwest | 6.315 | 0.000000 | 26.591 | 0.0 | 25.630187 | 4 | 0 | 3.0 | Moderate rain |
| 2241 | 2015-10-16 16:00:00 | Moderate rain | 276.430385 | 3.280385 | 271.085195 | 272.769039 | -0.166911 | 2.743809 | 12.095491 | 304.563206 | ... | Northwest | 6.415 | 0.000000 | 27.208 | 0.0 | 24.500756 | 4 | 0 | 3.0 | Moderate rain |
| 2242 | 2015-10-16 17:00:00 | Moderate rain | 276.242514 | 3.092514 | 271.085195 | 272.543725 | -0.187871 | 2.778891 | 12.095491 | 306.064377 | ... | Northwest | 6.515 | 0.000000 | 27.471 | 0.0 | 25.055266 | 4 | 0 | 3.0 | Moderate rain |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56604 | 2021-12-28 04:00:00 | Mild snowfall | 270.414644 | -2.735356 | 263.582322 | 267.519393 | 0.158137 | 2.467546 | 16.219425 | 214.752551 | ... | Southwest | 2.007 | 53.587745 | 39.138 | 0.0 | 20.613530 | 5 | 0 | 5.0 | Mild snowfall |
| 56605 | 2021-12-28 05:00:00 | Mild snowfall | 270.540634 | -2.609366 | 263.824712 | 267.697892 | 0.125990 | 2.431944 | 16.219425 | 211.899440 | ... | Southwest | 1.807 | 53.587745 | 38.076 | 0.0 | 21.978703 | 5 | 0 | 5.0 | Mild snowfall |
| 56606 | 2021-12-28 06:00:00 | Mild snowfall | 270.638388 | -2.511612 | 264.198556 | 267.826681 | 0.097754 | 2.412873 | 16.219425 | 209.410100 | ... | Southwest | 1.509 | 53.587745 | 32.855 | 0.0 | 23.830981 | 5 | 0 | 5.0 | Mild snowfall |
| 56607 | 2021-12-28 07:00:00 | Mild snowfall | 270.714928 | -2.435072 | 264.198556 | 267.913728 | 0.076540 | 2.405941 | 16.219425 | 206.407704 | ... | Southwest | 1.313 | 53.587745 | 26.111 | 0.0 | 27.455241 | 5 | 0 | 5.0 | Mild snowfall |
| 56608 | 2021-12-28 08:00:00 | Mild snowfall | 270.801398 | -2.348602 | 264.705794 | 268.001076 | 0.086470 | 2.420409 | 16.219425 | 201.798103 | ... | South | 1.113 | 53.587745 | 17.846 | 0.0 | 26.719272 | 5 | 0 | 5.0 | Mild snowfall |
11109 rows × 22 columns
knn_model = KNeighborsClassifier(
n_neighbors=5,
weights='uniform',
algorithm='auto',
leaf_size=30,
p=2,
metric='minkowski',
metric_params=None,
n_jobs=None)
svm_model = svm.SVC(
C=1.0,
kernel='rbf',
degree=3,
gamma='scale',
coef0=0.0,
tol=0.001,
cache_size=800,
class_weight='balanced',
random_state=42)
dtc_model = DecisionTreeClassifier(
criterion="entropy",
class_weight="balanced",
max_depth=10,
min_samples_leaf=3,
max_features=None,
random_state=42)
gbc_model = GradientBoostingClassifier(
loss='log_loss',
learning_rate=0.1,
n_estimators=100,
subsample=1.0,
criterion='friedman_mse',
min_samples_split=2,
min_samples_leaf=3,
min_weight_fraction_leaf=0.0,
max_depth=10,
)
5.4.2.2 Multiclass Classification - Training and Validation
# Visual inspection of the class distribution
wep_counts = df_extreme_weather['wep'].value_counts()
print(wep_counts)
sns.countplot(x = 'wep', data=df_extreme_weather, palette = "Set1")
wep Mild snowfall 3892 Moderate snowfall 2483 Moderate rain 2322 Heavy snowfall with accumulated snow 926 Frondurchlauf / Continuous freezing rain 875 Storm with freezing rain / Heavy snow- and icestorm 401 Snowstorm with high precipitation 210 Name: count, dtype: int64
<Axes: xlabel='wep', ylabel='count'>
Mapping labels and Events:
| label2 | wep |
|---|---|
| 0.0 | Moderate snowfall |
| 1.0 | Heavy snowfall with accumulated snow |
| 2.0 | Storm with freezing rain / Heavy snow- and ice pellets |
| 3.0 | Moderate rain |
| 4.0 | Snowstorm with high precipitation |
| 5.0 | Mild snowfall |
| 6.0 | Frondurchlauf / Continuous freezing rain |
# Defining the dependent and independent variables
X = df_extreme_weather.drop(columns=["label0", "label1", "label2", "wep", "avg_winddir", "wind_direction_label", "avg_temp_celsius", "run_datetime","display_label"])# Features
y = df_extreme_weather['label2'] # Labels
# Split into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('X_train count: ', X_train.shape[0])
print('X_test count: ', X_test.shape[0])
print('Y_train count: ', Y_train.shape[0])
print('Y_test count: ', Y_test.shape[0])
X_train count: 7776 X_test count: 3333 Y_train count: 7776 Y_test count: 3333
Excursus on Oversampling in Data Imbalance At this point applying oversampling techniques like SMOTE were planned to be used to handle the imbalanced dataset. It is a technique addressing imbalanced datasets, involves artificially boosting the minority class during model training. While effective, its application should be approached judiciously. During iterative prototyping it was discovered that the initial results without oversampling, using the imbalanced data, are satisfactory and the model shows proficiency in handling imbalances without oversampling. Introducing oversampling at this point whatsoever may risk overfitting and complicate the model without significant gains. Decisions on oversampling is hinge on a nuanced understanding of the dataset, continual model evaluation, and the potential trade-offs introduced by altering the class distribution. It's a delicate balance between rectifying imbalance and preserving the integrity of the underlying data patterns. For these reasons, ovesampling techniques were considered here, but the code was deleted after all, since the evaluation results are satisfactory and overfitting on the data and complicating the model is avoided 1.
Mohammed, R., Rawashdeh, J. and Abdullah, M., 2020, April. Machine learning with oversampling and undersampling techniques: overview study and experimental results. In 2020 11th international conference on information and communication systems (ICICS) (pp. 243-248). IEEE.↩
5.4.2.3 Multiclass Classification - Fit Model
knn_model_fit = knn_model.fit(X_train, Y_train)
svm_model_fit = svm_model.fit(X_train, Y_train)
dtc_model_fit = dtc_model.fit(X_train, Y_train)
gbc_model_fit = gbc_model.fit(X_train, Y_train)
5.4.2.4 Multiclass Classification - Evaluation on Test Set
# Predict using the trained KNN model on the test set
Y_pred_knn = knn_model_fit.predict(X_test)
# Generate the confusion matrix
cm_knn = confusion_matrix(Y_test, Y_pred_knn)
# Visualize the Confusion Matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues')
plt.title('K-Nearest Neighbors Confusion Matrix ')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
plt.savefig('../output/Multiclass Classification KNN Confusion Matrix.png')
# Display the classification report
print("Classification Report:\n",classification_report(Y_test, Y_pred_knn))
precision recall f1-score support
0.0 0.88 0.88 0.88 741
1.0 0.78 0.82 0.80 274
2.0 0.88 0.78 0.82 120
3.0 1.00 1.00 1.00 679
4.0 0.97 0.85 0.90 66
5.0 0.95 0.97 0.96 1190
6.0 0.84 0.80 0.82 263
accuracy 0.92 3333
macro avg 0.90 0.87 0.88 3333
weighted avg 0.92 0.92 0.92 3333
<Figure size 800x550 with 0 Axes>
The multiclass classification with K-Nearest Neighbors generally yields good results, with the actual outcomes closely aligning with the predictions.
The classification report provides a comprehensive overview of the model's performance across multiple classes. Precision, representing the accuracy of positive predictions, is generally high, ranging from 78% to 100%. Recall, which measures the ability of the model to capture all relevant instances, shows consistent performance, with values ranging from 78% to 100%. The F1-score, which considers both precision and recall, is strong across most classes, indicating a balanced trade-off between precision and recall. The macro average F1-score is 0.88, suggesting a good overall performance. The weighted average F1-score, considering class imbalance, is also 0.92, demonstrating the model's effectiveness in making accurate predictions across the entire dataset. The high accuracy of 92% further supports the model's reliability in classifying instances from diverse classes. Overall, the model exhibits robust performance across various metrics, showcasing its ability to handle multiclass classification tasks effectively.
# Predict using the trained SVM model on the test set
Y_pred_svm = svm_model_fit.predict(X_test)
# Generate the confusion matrix
cm_svm = confusion_matrix(Y_test, Y_pred_svm)
# Visualize the Confusion Matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Blues')
plt.title('Support Vector Machine Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
plt.savefig('../output/Multiclass Classification SVM Confusion Matrix.png')
# Display the classification report
print("Classification Report:\n",classification_report(Y_test, Y_pred_svm))
precision recall f1-score support
0.0 0.81 0.61 0.69 741
1.0 0.41 0.69 0.52 274
2.0 0.78 0.55 0.64 120
3.0 0.95 1.00 0.98 679
4.0 0.70 0.94 0.80 66
5.0 0.94 0.92 0.93 1190
6.0 0.68 0.67 0.68 263
accuracy 0.82 3333
macro avg 0.75 0.77 0.75 3333
weighted avg 0.84 0.82 0.82 3333
<Figure size 800x550 with 0 Axes>
The Support Vector Machine (SVM) exhibits a higher rate of misclassifying events compared to the K-Nearest Neighbors (KNN) algorithm. Specifically, Class 0 experiences a frequent misprediction, often being classified as Class 1 by the SVM. This discrepancy suggests a higher tendency for the SVM to incorrectly assign instances from Class 0 to Class 1, potentially indicating challenges in distinguishing between these two classes. The performance disparity between SVM and KNN highlights the importance of considering the specific characteristics of the dataset and the nature of the classes when selecting an appropriate classification algorithm.
The classification report reveals some variations in the model's performance across different classes compared to the previous report. For class 0.0, precision has slightly decreased to 0.81, indicating that the positive predictions for this class are now at a slightly lower accuracy. Recall for class 1.0 has improved to 0.69, signifying an enhancement in capturing more relevant instances, and the F1-score for this class has also increased to 0.52. Class 2.0 shows an improvement in precision (0.78), but a decrease in recall (0.55), resulting in a slightly lower F1-score of 0.64. Class 4.0 exhibits a notable increase in precision (0.70) and a slight decrease in recall (0.94), leading to a higher F1-score of 0.80. Overall, the macro-average precision and recall remain relatively consistent at 0.75 and 0.77, respectively, contributing to a macro-average F1-score of 0.75. The weighted average F1-score stands at 0.82, indicating an overall improvement in the model's ability to balance precision and recall across the dataset, with an accuracy of 82%.
from sklearn.tree import plot_tree
# Make predictions using the Decision Tree Classifier (DTC) model on the test set
Y_pred_dtc = dtc_model_fit.predict(X_test)
# Calculate the confusion matrix
cm_dtc = confusion_matrix(Y_test, Y_pred_dtc)
# Visualize the Confusion Matrix using seaborn heatmap
plt.figure(figsize=(10,7))
sns.heatmap(cm_dtc, annot=True, fmt='d', cmap='Blues')
plt.title('Decision Tree Classifier confusion matrix ')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
# Save the confusion matrix plot
plt.savefig('../output/Multiclass Classification DTC Confussion Matrix.png')
# Print the classification report
print("Classification Report:\n",classification_report(Y_test, Y_pred_dtc))
precision recall f1-score support
0.0 0.94 0.92 0.93 741
1.0 0.84 0.93 0.88 274
2.0 0.89 0.91 0.90 120
3.0 1.00 0.99 1.00 679
4.0 0.95 0.94 0.95 66
5.0 0.98 0.97 0.98 1190
6.0 0.92 0.92 0.92 263
accuracy 0.95 3333
macro avg 0.93 0.94 0.94 3333
weighted avg 0.96 0.95 0.95 3333
<Figure size 800x550 with 0 Axes>
The Decision Tree Classifier predicts all weather events well. It exhibits commendable performance in classifying weather events across multiple metrics. It achieves high precision, recall, and F1-score for most weather event classes, indicating its effectiveness in making accurate predictions. The classifier demonstrates particularly strong performance for classes 3.0, 4.0, and 5.0, achieving perfect precision and recall values. The overall accuracy of 95% underscores the classifier's ability to correctly classify the majority of instances. Comparing this result to previous models, the Decision Tree Classifier performs well. The Decision Tree's interpretability and simplicity make it a valuable model, but its performance may be surpassed by more sophisticated models in certain scenarios.
# Visualize the Decision Tree using plot_tree from sklearn
plt.figure(figsize=(40,20))
fig = plot_tree(dtc_model,
filled = True,
rounded = True,
fontsize = 14)
# Make predictions using the Gradient Boosting Classifier (GBC) model on the test set
Y_pred_gbc = gbc_model_fit.predict(X_test)
# Calculate the confusion matrix
cm_gbc = confusion_matrix(Y_test, Y_pred_gbc)
# Visualize the Confusion Matrix using seaborn heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(cm_gbc, annot=True, fmt='d', cmap='Blues')
plt.title('Gradient Boosting Classifier confusion matrix ')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
# Save the confusion matrix plot
plt.savefig('../output/Multiclass Classification GBC Confusion Matrix.png')
# Print the classification report
print("Classification Report:\n",classification_report(Y_test, Y_pred_gbc))
# Visualize the Decision Tree using plot_tree from sklearn (assuming it's a GBC model)
plt.figure(figsize=(16, 9))
precision recall f1-score support
0.0 0.97 0.97 0.97 741
1.0 0.93 0.96 0.94 274
2.0 0.96 0.92 0.94 120
3.0 1.00 1.00 1.00 679
4.0 0.98 0.94 0.96 66
5.0 0.99 0.99 0.99 1190
6.0 0.95 0.95 0.95 263
accuracy 0.98 3333
macro avg 0.97 0.96 0.96 3333
weighted avg 0.98 0.98 0.98 3333
<Figure size 26500x11500 with 0 Axes>
<Figure size 800x550 with 0 Axes>
<Figure size 26500x11500 with 0 Axes>
The Confusion Matrix for the Gradient Boosting Classifier shows very well performance as most labels are predicted correctly.
The Classification Report demonstrates remarkable performance across multiple metrics. It achieves high precision, recall, and F1-score for most weather event classes. Notably, it maintains precision rates above 94% for all classes, indicating a low false positive rate. The recall values, reflecting the ability to correctly identify instances of each class, are consistently high, ranging from 92% to 100%. The overall accuracy of 98% further emphasizes the classifier's proficiency in correctly classifying instances. Compared to the preceding models, the Gradient Boosting Classifier stands out with superior accuracy and balanced performance across various weather event classes. Its ability to handle complex relationships within the data and make accurate predictions makes it a robust choice for this classification task.
5.4.2.5 Multiclass Classification - Save Model
# Create a meaningful name with a timestamp
model_name_knn = f"knn_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_svm = f"svm_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_dtc = f"dtc_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
model_name_gbc = f"gbc_model_fit_{datetime.now().strftime('%Y%m%d_%H%M')}.pkl"
# Specify the folder path
folder_path = "../models/"
# Save the model to the specified folder with the created name
with open(folder_path + model_name_knn, 'wb') as f:
pickle.dump(knn_model_fit, f)
with open(folder_path + model_name_svm, 'wb') as f:
pickle.dump(svm_model_fit, f)
with open(folder_path + model_name_dtc, 'wb') as f:
pickle.dump(dtc_model_fit, f)
with open(folder_path + model_name_gbc, 'wb') as f:
pickle.dump(gbc_model_fit, f)
The analysis of the classification reports provides valuable insights into the performance of different classifiers across multiple weather event labels. The Extra Trees, XGBoost, and Random Forest classifiers consistently demonstrate high precision, recall, and F1-score across various weather event categories, showcasing their effectiveness in accurately predicting events. The Support Vector Machine (SVM) tends to misclassify events more frequently, especially for class 0, often predicting it as class 1. On the other hand, the K-Nearest Neighbors (KNN) algorithm generally performs well, with actual results closely aligning with predictions. The Gradient Boosting Classifier outshines other models, exhibiting exceptional precision, recall, and F1-score for most classes and achieving an impressive overall accuracy of 98%. The Decision Tree Classifier also performs well, with high precision and recall values for most classes, resulting in an accuracy of 95%. Comparatively, the SVM faces challenges in achieving robust performance across all classes, while the KNN algorithm proves to be reliable. The Gradient Boosting and Decision Tree classifiers emerge as top performers, providing accurate predictions across a diverse range of weather event labels.
Generally, the results for the multiclass classification analysis are excellent, proofing that extreme weather events can be predicted with a very high accuracy using mutliclass classification techniques.
In this exploratory data analysis project, titled "Weather Data Analysis: A Regression and Classification Approach on the ERA5 Dataset," an endeavor was made to understand and predict weather patterns in the Bancroft region of Ontario, Canada, utilizing the ERA5 dataset from 2015 to 2022. The project was motivated by the significant role that accurate weather prediction plays in various sectors and the unique climate characteristics of the region, influenced by the lake effect. The complex task of weather prediction was approached through regression and classification analyses, with the aim of modeling weather parameters and predicting weather events, including extreme conditions.
The regression analyses in chapter 5.3.4. aimed to forecast the temperature with historical data. The research question "Is it possible to build an accurate regression model to predict temperature based on historical data?" can be answered with yes. The results indicated that while an exact daily weather forecast remains challenging, satisfactory levels of accuracy in predicting the temperature forecast for the general temperature trends over the year using linear regression models were achieved. Support Vector Regressor emerged as the most effective model in this context.
However, predicting the temperature using the wind speed variable in a linear regression analysis (5.3.1.) or a paramerer mix of different variables in a multiple predictor linear regression analysis (5.3.2.) did not proof to be successfull. The research question "Is it possible to find a correlation or causation between the temperature and windspeed, windgust or winddirection using regression techniques?" can therefore be answered with no. Using the given data, linear regression models cannot be successfully, with a satisfactory accuracy, be used to predict the temperature with the wind variables or a mix of variables. The analyses proofed, that the cause of this is for one the mostly non-linear relationship of the parameters, which proofs to be a lot more complex. The correlation of the parameters also was not sufficiently high for a linear regression analysis. For these reasons it was decided that the further approach is to continue the analysis with logistic regression and classification techniques.
Also, the SARIMAX model used for temperature and wind modeling showed a systematic bias, consistently overestimating temperatures. This highlighted the limitations of using SARIMAX for predicting temperature variations based on wind, prompting the need for further refinement and exploration of alternative modeling approaches.
The last regression analysis was using logistic regression techniques to classify extreme weather and bluse sky day events in a binary classification. The research question was "Can logistic regression effectively classify and predict the occurrence of extreme or normal weather events based on temperature (or alternatively windspeed) ranges?". The approach to predict extreme weather events based solely on temperature using logistic regression's techniques (5.3.5) was insufficient, often misclassifying extreme events as normal. This underscored the necessity for more complex or multivariate approaches to accurately anticipate hazardous weather conditions. Instead of further optimizing the logistic regression approach (e.g., hyperparameter tuning or multiple predictor regression), it was decided that the more promising approach would be to identify further binary classifiers as presented in the classification analysis chapters.
In binary classification the goal was to predict whether an observation was an extreme weather or blue sky day event (5.4.1.). The asked research question was "Is it possible to classify and predict extreme weather events such as storms?". It was identified that extreme weather events can indeed very accurately be separated from blue sky day events and both classes can be predicted with a very high accuracy, precision and recall. "ExtraTreesClassifier," "XGBClassifier," "RandomForestClassifier," and "LGBMClassifier" are the top-performing classifiers based on LazyClassifier's assessment. Each demonstrated high accuracy, with XGBoost slightly leading the pack. These models proved effective in categorizing and predicting weather events from the given data, providing valuable tools for future weather prediction endeavors. The results of this analysis could then be used in multiclass classification, to determine the specific type of extreme weather event.
The multiclass classification further nuanced the understanding of various weather events (5.4.2.). The goal was to determine and classify the specific type of extreme weather event, answering the research question "Is it possible to categorize and predict different extreme weather events based on multivariate weather data?". The research question can be answered with yes, the prediction and categorization of various extreme weather events is possible with a very high accuracy, precision and recall. Gradient boosting emerged as a particularly potent method, achieving high precision, recall, and F1-scores across all classes. This success illustrates the potential of sophisticated classification algorithms in deciphering complex weather patterns and predicting diverse weather events. This knowlegde can then also be used by scientists for further research for governmental institutions, e.g., when it comes to taking countermeasures to prevent damage from certain extreme weather events and minimize the risks and dangers.
This project's journey through regression and classification analyses of weather data has provided a deeper insight into the atmospheric dynamics of Bancroft, Ontario. While profound analyses and research in temperature trend prediction and weather event classification was made, the challenges and inaccuracies encountered remind of the complexities inherent in meteorological studies. Predicting the weather accurately remains a daunting task, demanding continual refinement of models and methods.
Critically reflecting on the used methodology and procedure, it can be said, that after conducting the EDA, which showed, that wind and temperature variables do not correlate or show association, further regression analyses of these parameters could have been discontinued there. But since scientific literature discussed, that such an analysis proofs to be valuable and other researchers already made successfull aproaches with those techniques on this data, the approach was continued, to further evaluate patterns and characteristics of temperature and wind parameters and their correlations. Especially the mutliple predictor analysis promised to deliver interesting results and after all, the research question and hypotheses could succesfully be refuted, which in the end is a very valuable contribution to science.
Furthermore, the original ERA5 dataset was reduced using PCA and feature forward selection / feature backward deletion to conduct analysis on feature relevance. Only a reduced dataset of parameters was used in the final analyses. In further research other variables and their association / correlation could be analyzed. For example, literature suggests that cloud cover has a strong influence on air temperature. Using the same methodology and implemenation as in this assignment, simply switching the parameters, additional research with valuable insights into meteorological data can be conducted, resulting in better results.
Moreover, the generalizability of the data has to be scrutinized, as the analyses are based on regional data and meteorological data is always very biased towards regional effects. Conducting the same analysis on data for regions in e.g., African countries might result in completely different results.
The unpredictability and irregularity of weather and temperature patterns is a crucial aspect that adds complexity to the analysis. Acknowledging the irregular nature of meteorological phenomena emphasizes the inherent challenges in making precise predictions, even with advanced analytical techniques.
While the analyses presented valuable insights, the potential for further optimization, including hyperparameter tuning, remains. Exploring these avenues could enhance the performance of the models and provide more accurate predictions, especially in the context of fine-tuning parameters for machine learning algorithms.
The exploration of weather patterns could be expanded to include a dedicated analysis of trends related to climate change. Understanding the long-term impacts on temperature and weather events could provide valuable insights into the broader environmental context.
Acknowledging that certain factors, such as climate change and omitted variable bias, might not have been explicitly addressed in the analysis, highlights the potential sources of variance and errors. Future research could delve deeper into these factors to refine models and improve predictive accuracy. It's essential to recognize that certain aspects, possibly including external factors or variables not considered in the analysis, might influence the weather and temperature trends. Acknowledging what falls out of the scope of the current study adds a layer of humility to the findings and encourages future researchers to explore additional dimensions.
In summary, while the project has contributed significantly to understanding and predicting weather patterns, there exist additional dimensions and challenges that warrant exploration. The complexities of meteorological studies, coupled with the acknowledgment of unpredictable weather dynamics, call for continual refinement, optimization, and consideration of broader environmental factors for a comprehensive understanding of weather phenomena. Our findings contribute to the broader discourse on weather prediction, emphasizing the need for multidimensional approaches and the potential of machine learning techniques. As climate variability continues to present profound challenges, the insights garnered here offer a stepping stone towards more accurate, reliable, and comprehensive weather forecasting methods. Moving forward, integrating more diverse datasets, refining models, and exploring new methodologies will be crucial in enhancing the predictive capabilities and understanding of weather patterns. This endeavor not only aids in better forecasting but also in strategic planning and preparedness for the diverse impacts of weather and climate change across sectors.
!pip freeze requirements.txt
aiohttp==3.9.1 aiosignal==1.3.1 alembic==1.13.0 altair==5.2.0 annotated-types==0.6.0 anyio==3.7.1 asttokens==2.4.1 attrs==23.1.0 certifi==2023.11.17 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 colorlog==6.8.0 comm==0.2.0 contourpy==1.2.0 cycler==0.12.1 Cython==3.0.6 dataclasses-json==0.6.3 debugpy==1.8.0 decorator==5.1.1 distlib==0.3.7 executing==2.0.1 fastapi==0.104.1 fastjsonschema==2.19.0 filelock==3.13.1 fonttools==4.45.1 frozenlist==1.4.0 greenlet==3.0.1 idna==3.6 imbalanced-learn==0.11.0 imblearn==0.0 ipykernel==6.26.0 ipython==8.17.2 jedi==0.19.1 Jinja2==3.1.2 joblib==1.3.2 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.20.0 jsonschema-specifications==2023.11.1 jupyter_client==8.6.0 jupyter_core==5.5.0 kiwisolver==1.4.5 langchain==0.0.346 langchain-core==0.0.10 langsmith==0.0.69 lazypredict==0.2.12 lightgbm==4.1.0 Mako==1.3.0 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.2 matplotlib-inline==0.1.6 multidict==6.0.4 mypy-extensions==1.0.0 nbformat==5.9.2 nest-asyncio==1.5.8 numpy==1.26.2 opensearch-py==2.4.2 optuna==3.4.0 packaging==23.2 pandas==2.1.3 parso==0.8.3 patsy==0.5.4 Pillow==10.1.0 platformdirs==4.0.0 plotly==5.18.0 pmdarima==2.0.4 podman==4.8.0.post1 podman-compose==1.0.6 prompt-toolkit==3.0.40 protobuf==4.25.1 psutil==5.9.6 pure-eval==0.2.2 pyarrow==14.0.1 pydantic==2.5.2 pydantic_core==2.14.5 Pygments==2.16.1 pyparsing==3.1.1 PyPDF2==3.0.1 python-dateutil==2.8.2 python-dotenv==1.0.0 pytz==2023.3.post1 pywin32==306 pyxdg==0.28 PyYAML==6.0.1 pyzmq==25.1.1 referencing==0.31.0 requests==2.31.0 rpds-py==0.13.1 scikit-learn==1.3.2 scipy==1.11.4 seaborn==0.13.0 setuptools==69.0.2 six==1.16.0 skforecast==0.11.0 sniffio==1.3.0 SQLAlchemy==2.0.23 stack-data==0.6.3 starlette==0.27.0 statsmodels==0.14.0 tenacity==8.2.3 threadpoolctl==3.2.0 toolz==0.12.0 tornado==6.3.3 tqdm==4.66.1 traitlets==5.13.0 typing-inspect==0.9.0 typing_extensions==4.8.0 tzdata==2023.3 urllib3==2.1.0 vegafusion==1.4.5 vegafusion-python-embed==1.4.5 virtualenv==20.25.0 vl-convert-python==1.1.0 wcwidth==0.2.10 xgboost==2.0.2 yarl==1.9.4 yellowbrick==1.5